vec_num <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
vec_num [1] 0 1 1 2 3 5 8 13 21 34
We now move to data structures. In Chapter 3 the focus was on individual numbers, booleans, strings and date/times. Data structures hold multiple values. In R you can work the following data structures:
vectors
factors
matrix
array
lists
data frames
tibbles
A vector is a one dimensional data structure: it has one row and one or more columns. In case there is only one column, an R vector holds one number, one character variable, one logical or one data/time value. In other words, in the previous chapter, we actually used vectors. A vector, like a matrix or an array, is homogeneous: is allows you to store one type of variable (e.g. numeric, character, …).
To create a vector we use the c() function to combine the elements within this function in one data structure. Let’s create a vector with numbers:
vec_num <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
vec_num [1] 0 1 1 2 3 5 8 13 21 34
You can see the total number of elements in this vector using the length() function:
length(vec_num)[1] 10
Here, we nave a total of 10 columns. You can see in the environment pane that the vector is a 1x10 vector: 1 row and 10 columns.
If you check the type of this vector, you’ll see that its type is “double”. In other words, it is a numeric vector.
typeof(vec_num)[1] "double"
You can check if an object is a vector using
is.vector(vec_num)[1] TRUE
In addition, you can check if a vector is of a given type by including the mode (numeric, logical, …) in is.vector():
is.vector(vec_num, mode = "numeric")[1] TRUE
The type of vec_num is “double”. You can create a vector with other data types:
vec_char <- c("cat", "mouse", "dog", "bird")
vec_log <- c(TRUE, FALSE, TRUE, TRUE)
vec_int <- c(1L, 10L, 50L)
vec_dat <- c(as.POSIXct("2025-03-25"), as.POSIXct("2025-04-25"))You can check the type of the data stored in all these vectors:
typeof(vec_char)[1] "character"
typeof(vec_log)[1] "logical"
typeof(vec_int)[1] "integer"
typeof(vec_dat)[1] "double"
The type of a vector is the type common to all individual elements. In other words, a vector only holds elements of the same type. If this is not the case, R will change the type of all elements in the vector to a type that fits all. This is also called implicit coercion: R chooses the type for that data that fits all components of the data structure. For a vector, this means that all values in the columns will have the same type.
For instance, suppose that we have a vector
vec_1 <- c(1, "2", 3)Here, we mix two numeric values with 1 character value “2”. If you take a look at this vector, you’ll see that R changes all elements in characters:
vec_1[1] "1" "2" "3"
You can verify this by checking the type
typeof(vec_1)[1] "character"
As you can see, vec_1 is not a numeric vector, but a character vector. Let’s take another example:
vec_2 <- c(TRUE, FALSE, 5, as.POSIXct("2025-03-25"))
vec_2[1] 1 0 5 1742857200
In this example, we have a mix of logical values (TRUE, FALSE), a numeric value and a date/time value. R uses a common type and sets TRUE equal to 1, FALSE equal to 0 and show the number of seconds since January 1, 1970. In other words, R implicitly coerces the vector into a double vector.
typeof(vec_2)[1] "double"
Let’s see what happens if we mix logical, character and numeric values:
vec_3 <- c(TRUE, FALSE, "a", 5)
vec_3[1] "TRUE" "FALSE" "a" "5"
Here, from the quotation marks, you can see that R changes the type of all individual elements into character values. These three examples are examples of implicit coercion: R tries to find a way to represent the elements in a vector using a common type. Sometimes, this implicit coercion makes sense, sometimes it doesn’t. For instance, combining a numeric value and a character representation of a numeric value creates a character vector. The reason why R changes numbers into characters is that usually, you can represent a number as a character, while you can not always represent a character as a number. In a similar way, because you can represent a logical value in a number, but a number not always in a logical value - unless that number happens to be 0 or 1, R will set the type of a vector that includes both logical and numeric values in numeric. The same holds for the mixture of date/time and logical, data/time and numeric and data/time, logical and numeric.
You can coerce the type of a vector using an as. function: as.numeric(), as.integer(), as.character(), as.logical() or as.Date() or as.POSIXct(). Here, the coercion is explicit. In that case, R will try to change all elements into the same type. In case this is impossible, R produces NA’s. For instance, let’s try to change the three vectors vec_1, vec_2 and vec_3 in numeric:
as.numeric(vec_1)[1] 1 2 3
as.numeric(vec_2)[1] 1 0 5 1742857200
as.numeric(vec_3)Warning: NAs introduced by coercion
[1] NA NA NA 5
For vec_1and vec_2 R could change all the elements in type numeric: in vec_1 R managed to change the character “2” in a number 2. The same holds for vec_2. Here R could change the type of TRUE, FALSE in 1 and 0 and set the date/time variable in numeric format. For vec_3, changing all elements in numeric was impossible. As a matter of fact, with the exception of the number 5, R didn’t manage to change the type at all. Why couldn’t R change “TRUE” or “FALSE” in 1 and 0 as it could in vec_2. Here, TRUE and FALSE were character values, not boolean. When vec_3 was created, R changed the type of all its values in “character”. In other words, as far as R is concerned, TRUE became “TRUE” and R doesn’t keep track of the path that led it to “TRUE”. In other words, R doesn’t recall changing TRUE into TRUE. Because of this, R didn’t manage to change the “TRUE” (back )into a boolean TRUE from there into a number. As this was not possible, it replaced that value with an NA.
You can change an object (e.g. a column in a data frame) into a vector using the as.vector() function. This function takes two arguments: the object that you want to convert into a vector and the vector type. For instance
vec_4 <- as.vector(vec_1, mode = "numeric")
vec_4[1] 1 2 3
creates a numeric vector from vec_1. Note that here this operation was not as useful as vec_1 is a numeric vector. However, in later chapters we will convert variables or column in a date frame in vectors. To do so, we will often have to be explicit in the mode. Leaving out the mode, R will copy the type of e.g. the column in a data frame into the mode.
So far, all vectors were created using c() including all elements one for one in this function. Using the vector(type, length = ) function, you can create an empty vector of a given length and type. For instance, to create an empty numeric vector of length 10:
vec_1 <- vector("numeric", length = 10)
vec_1 [1] 0 0 0 0 0 0 0 0 0 0
As you can see, this vector is filled with 0. Note that this is a numeric vector but only for now. If you would change one of its elements in a character, the full vector would change from numeric into character. If you want to create an empty character vector:
vec_2 <- vector("character", length = 10)
vec_2 [1] "" "" "" "" "" "" "" "" "" ""
Here, you can see that empty is a space (recall that a space of a character).
Creating a vector with “0” values can be very useful before a for loop. Suppose that you have a for loop where each ‘loop’ adds the result of a calculation to a vector. Here, you have two option. First, you allow the vector to ‘grow’ in every loop. Second, you define an empty vector with the same length as the number of loops and you fill each element as you run through the loop. The first option is not very efficient as R will copy the entire vector you have each time you expand it with one element. This is not the case if you create the vector before the loop. Here, R fills one element after the other but doesn’t need to grow the vector.
Recall that NA are missing observations. If a vector includes NA values, that will not change the vector’s type. To see this, let’s create two vectors, one numeric and one character, which both include NA and show their type:
vec_1 <- c(10, 30, NA, 40)
vec_2 <- c("dog", NA, "cat")
typeof(vec_1)[1] "double"
typeof(vec_2)[1] "character"
The same hold for NaN (not a number) and Inf (infinity). If these and NA are part of a character vector, they will become character values “NA”, “Nan” of “Inf”. In other words, they’ll be considered characters and not special values.
Create a numeric vector with 5 columns, 1, 2, 3, 4 and 5. Assign this vector to vec_yt1
vec_yt1 <- c(1, 2, 3, 4, 5)Check the type of this vector
typeof(vec_yt1)[1] "double"
Create a new vector, vec_yt2 with values TRUE, FALSE, TRUE, TRUE, FALSE and check the class and type of this vector
vec_yt2 <- c(T, F, T, T, F)
class(vec_yt2)[1] "logical"
typeof(vec_yt2)[1] "logical"
Determine the length of the vector vec_yt1.
length(vec_yt1)[1] 5
Create a character vector vec_yt3 whose elements include: south, west, east, north.
vec_yt3 <- c("south", "west", "east", "north")Determine the length of this vector and the number of characters
length(vec_yt3)[1] 4
nchar(vec_yt3)[1] 5 4 4 5
Can you store the number of characters in a new vector vec_yt3n?
vec_yt3n <- nchar(vec_yt3)You can define names for the columns of a vector. You can do so when you create the vector using the c() function or the setNames() function, or, at a later sage, using the names() functions. Suppose that you have a vector with exam results for three courses, A, B and C. Using a named vector, allows you to identify the columns:
vec_1 <- c(A = 15, B = 13, C = 17)The vector now includes column names. You can see that this is the case in the environment pane where vec_1 is now identified as a Named num [1:3]. These columns are also included if you ask R to show the vector:
vec_1 A B C
15 13 17
There are other ways to add names. Using setNames() you can define both the vector as well as the names. Using the previous example:
vec_2 <- setNames(c(15, 13, 17), c("A", "B", "C"))In a final example, we’ll use the names() function to add names after the vector was created. Let’s first create a vector:
vec_3 <- c(15, 13, 17)To add names, we include them in a another vector and use names() to assign names to vec_3:
names(vec_3) <- c("A", "B", "C")
vec_3 A B C
15 13 17
The names function adds an attribute to the vector. To see this, let’s check the attributes of vec_3:
attributes(vec_3)$names
[1] "A" "B" "C"
You can also use the names() function to extract the names of a vector:
var_names <- names(vec_1)
var_names[1] "A" "B" "C"
Here, R checks the attributes of the vector vec_1 and copies the names of the variables to var_names. As an alternative, you could have done the same using
attributes(vec_1)$names[1] "A" "B" "C"
Here, R reads the attributes of vec_1 and extracts the names of the columns.
Extracting the names allows you to store these names in a character vector that you can use in your work flow. With many columns, you can see the names using e.g. str():
str(vec_1) Named num [1:3] 15 13 17
- attr(*, "names")= chr [1:3] "A" "B" "C"
Here, too, you can see that names are defined as an attribute.
To remove the names of columns, you can use unname(obj, force = FALSE). The first arguments is the object (e.g. vector) whose names you want to remove; the second is a specific option to remove names even if the object is a data frame. You can usually keep the default value FALSE.
vec_3 <- unname(vec_3)
vec_3[1] 15 13 17
For the vector vec_yt1 with elements 1, 2, 3: add names A, B and C to this vector. To this in three ways.
vec_yt1 <- c(A = 1, B = 2, C = 3)
vec_yt1A B C
1 2 3
vec_yt1 <- setNames(c(1, 2, 3), c("A", "B", "C"))
vec_yt1A B C
1 2 3
unname(vec_yt1 <- c(1, 2, 3))[1] 1 2 3
names(vec_yt1) <- c("A", "B", "C")
# Note that you can use setNames() as well
setNames(vec_yt1, c("A", "B", "C"))A B C
1 2 3
vec_yt1A B C
1 2 3
Check the attributes of vec_yt1:
attributes(vec_yt1)$names
[1] "A" "B" "C"
Extract the names of vec_yt1 and store them in a vector vec_yt1_names
# Option 1
vec_yt1_names <- names(vec_yt1)
# Option 2
vec_yt1_names <- attributes(vec_yt1)$namesWould the following code work to remove the names from vec_yt1? If not, how can you remove the names?
unname(vec_yt1)
attributes(vec_yt1)Does it work? I you don’t think so, check:
vec_yt1 <- unname(vec_yt1)
attributes(vec_yt1)NULL
Using rep(x, times, length.out, each) you can replicate the values in a vector x. Suppose you want a vector where all elements repeat a value 10 times. The first argument is the values yo want to replicate. This can be any value: number, character, a vector … . The second to last arguments determine how many times or how x needs to be replicated. To create a vector with 10 columns and all values equal to 25
vec_rep <- rep(x = 25, times = 10)Here, x was a number, but you can also replicate characters or other vectors:
vec_rep_char <- rep(x = "ABC", times = 5)
vec_rep_vec <- rep(x = c(1, 2, 3), times = 5)
vec_rep_char[1] "ABC" "ABC" "ABC" "ABC" "ABC"
vec_rep_vec [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
length.out sets the length of the vector. If x is a single numeric, character, date/time using length.out and times is equivalent. If x is a vector, this is not the case. In the previous example, c(1, 2, 3) was replicated 5 times. In other words, the length of the output vector was 15. Using length.out you can set the total length. In doing so, R will replicate the vector, but will do so only partially on the last replication. For instance, if you set the length.out = 10, the length of the output vector is 10:
vec_rep_vec <- rep(x = c(1, 2, 3), length.out = 10)
vec_rep_vec [1] 1 2 3 1 2 3 1 2 3 1
With times and length.out you replicate the full vector on every replication. Using each you replicate each element of the vector each times. In other words, the output vector will show the first element of the input vector each times before it changes to the second element of the input vector.
vec_rep_vec <- rep(x = c(1, 2, 3), each = 3)
vec_rep_vec[1] 1 1 1 2 2 2 3 3 3
Adding length.out sets a limit on the total length of the output vector. It does so by reducing the number of replications of the last element in the input vector:
vec_rep_vec <- rep(x = c(1, 2, 3), length.out = 7, each = 3)
vec_rep_vec[1] 1 1 1 2 2 2 3
There are a number of functions that produce a vector. These can be grouped into functions that generate a sequence, function that generate a vector with random numbers, vectors that are created by sampling another vector and vectors as a result of set operations.
To create a vector, we used c() and included all its values. Some functions allow you to create a special vector. seq() allows you to fill a vector with a sequence of numbers. To do so, this function requires a start point (from), and endpoint (to) and either the increment of the sequence (by) or the length of the sequence (length.out). If length.out is specified, then R calculates the increments of the sequence. To see how this works, let’s create a vector which holds a sequence starting at 1, ending at 10 in steps of 1:
vec_1 <- seq(from = 1, to = 10, by = 1)
vec_1 [1] 1 2 3 4 5 6 7 8 9 10
As an alternative, we can create the same sequence using the length.out argument:
vec_2 <- seq(1, 10, length.out = 10)
vec_2 [1] 1 2 3 4 5 6 7 8 9 10
What happens if the last increment of the sequence, starting from the starting position, doesn’t end in the value given in the by argument. In that case, seq() stops before the value in to is reached. For instance:
vec_3 <- seq(1, 10, by = 8)
vec_3[1] 1 9
Using the length.out = argument, the sequence always end in the value in to. That is so because R determines the increment using equally spaced intervals between from and to using length.out. You can use len or length as an alternative for length.out. As an example:
vec_3 <- seq(1, 10, length.out = 25)
vec_3 [1] 1.000 1.375 1.750 2.125 2.500 2.875 3.250 3.625 4.000 4.375
[11] 4.750 5.125 5.500 5.875 6.250 6.625 7.000 7.375 7.750 8.125
[21] 8.500 8.875 9.250 9.625 10.000
You can also use seq() with from, by and length.out. Here, you don’t specify the last value of the sequence. R will generate a sequence starting from the value in from and it will add the value in by length.out times. Note that in this case, you need to add the arguments of the function as you skip the second argument.
vec_4 <- seq(from = 10, by = 10, length.out = 25)
vec_4 [1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
[20] 200 210 220 230 240 250
Note that the increment can be negative. In that case, R will reduce the start value with the value of the increment until it reaches the end value or until is reaches the number of increments in length.out:
vec_5 <- seq(from = 100, to = 50, by = -10)
vec_5[1] 100 90 80 70 60 50
or, as an alternative
vec_6 <- seq(from = 100, by = -10, length.out = 5)
vec_6[1] 100 90 80 70 60
In specific cases, you can create a sequence using shorter notation. For instance, suppose you want a vector of integers, where each increment is exactly 1. To generate this sequence, you can use
vec_7 <- 21:30
vec_7 [1] 21 22 23 24 25 26 27 28 29 30
We will use this short way to writing a sequence often in a for loop:
i <- 1
for (i in 1:5) {
print("Hello World")
}[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
[1] "Hello World"
Here, i will adopt each value in 1:5, i.e. 1, 2, 3, 4 and 5 and print Hello World as long as i is smaller than or equal to 5. The counter i starts with a value 1 and the counter increases by 1 after every print of Hello World.
If the starting position is 1, sec.len() can be used as well:
vec_8 <- seq_len(10)
vec_8 [1] 1 2 3 4 5 6 7 8 9 10
You can use seq.Date() to generate a sequence of dates. The arguments of this function are very similar to those for the seq() function. As a matter of fact, if you would use seq() and not seq.Date() R would recognize that you are using seq() to generate a sequence of dates and would use sec.Date() without problem. The from argument is the start date, the to the end date. If you use to, you need to specify the increment. Here, you can use “day”, “week”, “month”, “quarter” or “year”. Note that “days”, “weeks”, “months”, “quarters” or “years” is also accepted. If you add an integer, R will increment with a a multiple of “days”, … . To illustrate, let’s create three vectors, all start on January 1, 2025 and end on December 31, 2025. The first increments in days, the second in 3 weeks and the last in quarters:
start_d <- as.Date("2025-01-01")
end_d <- as.Date("2025-12-31")
vec_d <- seq.Date(from = start_d, to = end_d, by = "day")
vec_w <- seq.Date(from = start_d, to = end_d, by = "3 weeks")
vec_q <- seq.Date(from = start_d, to = end_d, by = "quarter")R generates a sequence and ends the sequence before the date in to. To see this, let’s ask the maximum value in each of these vectors:
max(vec_d)[1] "2025-12-31"
max(vec_w)[1] "2025-12-24"
max(vec_q)[1] "2025-10-01"
If you increment with “day”, the last date is 2025-12-31. However, in both other cases, the last value of the sequence is before 2025-31-12. Using length_out, you determine the length of the sequence, but you allow R to determine the size of the increment if you include a value for to for the end point:
vec_d10 <- seq.Date(from = start_d, to = end_d, length.out = 10)
vec_d10 [1] "2025-01-01" "2025-02-10" "2025-03-22" "2025-05-02" "2025-06-11"
[6] "2025-07-22" "2025-08-31" "2025-10-11" "2025-11-20" "2025-12-31"
If you combine a value for both by and length.out R will determine the end date. For instance, if you use 2025-01-01 as your start day, and increment 10 times with 1 week, R will produce:
vec_w10 <- seq.Date(from = start_d, by = "weeks", length.out = 10)
vec_w10 [1] "2025-01-01" "2025-01-08" "2025-01-15" "2025-01-22" "2025-01-29"
[6] "2025-02-05" "2025-02-12" "2025-02-19" "2025-02-26" "2025-03-05"
As you would with seq() you can also use negative increments. In that case, R will count backwards in time. For instance, the generate a sequence starting on 2025-31-12 and ending at or before 2025-01-01 and steps of 5 weeks:
vec_db <- seq.Date(end_d, start_d, by = "-5 weeks")
vec_db [1] "2025-12-31" "2025-11-26" "2025-10-22" "2025-09-17" "2025-08-13"
[6] "2025-07-09" "2025-06-04" "2025-04-30" "2025-03-26" "2025-02-19"
[11] "2025-01-15"
Using seq.POSIXt you can generate date/time values. As was the case with seq.Date(), you can enter a starting date/time in the from argument, and end date/time in the to argument and supply the function with an increment “sec”, “min”, “hour”, “day”, “DSTday”, “week”, “month”, “quarter” of “year”. If you add an “s” that will not cause an error. In other words, R know the day is equal to days. In addition, you can add an integer to increment in multiples of “sec”. The difference between “day” and “DSTday” has to to be daylight savings time. DSTday takes daylight savings time into account. Is you include from, to and length.out, R determines the increment. With from, by and length.out R generates a sequence by adding the increment in by as many times and determined in length_out. If the time zone is not UTC, it has to be specified in from. Here are a couple of examples:
start_d <- as.POSIXct("2025-01-01 12:00:00")
end_d <- as.POSIXct("2025-01-05 12:00:00")
vec_dt_hour <- seq.POSIXt(from = start_d, to = end_d, by = "6 hours")
vec_dt_10 <- seq.POSIXt(from = start_d, to = end_d, length.out = 10)
vec_dt_20 <- seq.POSIXt(from = start_d, by = "5 mins", length.out = 20)One can now look at the examples:
vec_dt_hour [1] "2025-01-01 12:00:00 CET" "2025-01-01 18:00:00 CET"
[3] "2025-01-02 00:00:00 CET" "2025-01-02 06:00:00 CET"
[5] "2025-01-02 12:00:00 CET" "2025-01-02 18:00:00 CET"
[7] "2025-01-03 00:00:00 CET" "2025-01-03 06:00:00 CET"
[9] "2025-01-03 12:00:00 CET" "2025-01-03 18:00:00 CET"
[11] "2025-01-04 00:00:00 CET" "2025-01-04 06:00:00 CET"
[13] "2025-01-04 12:00:00 CET" "2025-01-04 18:00:00 CET"
[15] "2025-01-05 00:00:00 CET" "2025-01-05 06:00:00 CET"
[17] "2025-01-05 12:00:00 CET"
vec_dt_10 [1] "2025-01-01 12:00:00 CET" "2025-01-01 22:40:00 CET"
[3] "2025-01-02 09:20:00 CET" "2025-01-02 20:00:00 CET"
[5] "2025-01-03 06:40:00 CET" "2025-01-03 17:20:00 CET"
[7] "2025-01-04 04:00:00 CET" "2025-01-04 14:40:00 CET"
[9] "2025-01-05 01:20:00 CET" "2025-01-05 12:00:00 CET"
vec_dt_20 [1] "2025-01-01 12:00:00 CET" "2025-01-01 12:05:00 CET"
[3] "2025-01-01 12:10:00 CET" "2025-01-01 12:15:00 CET"
[5] "2025-01-01 12:20:00 CET" "2025-01-01 12:25:00 CET"
[7] "2025-01-01 12:30:00 CET" "2025-01-01 12:35:00 CET"
[9] "2025-01-01 12:40:00 CET" "2025-01-01 12:45:00 CET"
[11] "2025-01-01 12:50:00 CET" "2025-01-01 12:55:00 CET"
[13] "2025-01-01 13:00:00 CET" "2025-01-01 13:05:00 CET"
[15] "2025-01-01 13:10:00 CET" "2025-01-01 13:15:00 CET"
[17] "2025-01-01 13:20:00 CET" "2025-01-01 13:25:00 CET"
[19] "2025-01-01 13:30:00 CET" "2025-01-01 13:35:00 CET"
As you can see from these examples, the way to use as.POSIXt() is very similar to the way you use seq.Date() or seq().
Generate a vector, vec_yt1 as a sequence
#| code-fold: true
vec_yt1 <- seq(from = 2, to = 12, by = 2) #| code-fold: true
vec_yt1 <- seq(from = 10, to = 0, by = -1) #| code-fold: true
vec_yt1 <- seq(from = 0, by = 5, length.out = 5) #| code-fold: true
vec_yt1 <- seq(from = 0, to = 14, by = 3) #| code-fold: true
vec_yt1 <- 5:50Suppose you have a date 2025-03-25 and you need a sequence of 6 dates by week. Write the do to create this sequence and store in a vector vec_ytd:
vec_ytd <- seq.Date(from = as.Date("2025-03-25"), by = "weeks", length.out = 6)
vec_ytd[1] "2025-03-25" "2025-04-01" "2025-04-08" "2025-04-15" "2025-04-22"
[6] "2025-04-29"
class(vec_ytd)[1] "Date"
Generate a vector, vec_yty that starts at 2000-01-01 and end 2024-12-31 by year. Format the dates so that they only show the year (hint use: ?format()) and use the pipe operator in your code.
vec_yty <- seq.Date(from = as.Date("2000-01-01", format = "%Y-%m-%d") , to = as.Date("2024-12-31", format = "%Y-%m-%d"), by = "year") |>
format(format = "%Y")We already covered statistical functions when we discussed numeric data. In that section, we showed how you can use pnorm(), dnorm(), qnorm() and rnorm(). However, with respect to the latter, rnorm(), we didn’t add too much detail. The same holds for the other function to generate random numbers from e.g. the t-distribution rt(), the uniform distribution runif(), the F-distribution rf() or rchisq()for the Chi-square distribution. In simulations, these random number generators are widely used. Before we move into these random number generates, a few words about the way software generates these numbers. Random number generators are not “random” but they follow an algorithm to generate a sequence of numbers whose properties approximate a random sequence. In other words, random numbers are not random, but their value is determined by and initial value that is used by the algorithm that generates this sequence. This is why random number generators are called pseudo random number generators. They generate a sequence that mimics the properties of a random sequence, but the sequence is fully determined by and initial value. That initial value is called the seed. There are many pseudo random number generators, but the same pseudo random number generator will produce the same sequence of random numbers if the seed it the same. In R, you can select the pseudo random number generator. The default is “Mersenne-Twister”. You can see all other pseudo random number generators that are available if you use ?Random in the console. Using set.seed, you can make sure that R generates the same sequence of random numbers, every time you ask R to generate a series. This function sets the initial value for the pseudo random number generator. Each time you use this value, you’ll get the same results. This is useful is you want to replicate your results. In addition, if you build a simulation, it is often useful to have the same sequence every time to add components to the simulation’s model.
Every statistical distribution is characterized by its parameters. For the normal distribution, these are the mean and the standard deviation, for Student’s t-distribution as well as the Chi square distribution this parameter is the degrees of freedom, for the uniform distribution you need the minimum and the maximum and for the F-distribution, the ratio of two independent chi square distributed variables, you need two degrees of freedom. If you supply these parameters, you can generate random numbers of these distributions:
set.seed(1000)
v_norm <- rnorm(n = 100, mean = 0, sd = 1)
v_t <- rt(n = 100, df = 5)
v_unif <- runif(n = 100, min = 0, max = 100)
v_chi <- rchisq(n = 100, df = 5)
v_f <- rf(n = 100, df1 = 10, df2 = 2)With 100 random draws each, we can show the probability density distribution of each of these 5 randomly generated values using base R’s hist() function:
hist(v_norm, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "Normal")
lines(density(v_norm), lwd = 3, col = "darkgrey")hist(v_t, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "Student's t")
lines(density(v_t), lwd = 3, col = "darkgrey")hist(v_unif, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "Uniform")
lines(density(v_unif), lwd = 3, col = "darkgrey")hist(v_chi, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "Chi squared")
lines(density(v_chi), lwd = 3, col = "darkgrey")hist(v_f, probability = TRUE, col = "lightblue", border = "white", xlab = "Value", main = "F distribution")
lines(density(v_f), lwd = 3, col = "darkgrey")A sample refers to a subset of values from a vector that are drawn random. sample(x, size, replace = FALSE, prob = NULL) allows you to draw a random sample of size n, from a vector x . By default, sampling is done without replacement. In other words, an element can not appears twice in the sample unless it is included more than once in the vector x. In addition, all elements are equally likely to be drawn (prop = NULL). To illustrate this function, let’s use
vec_1 <- seq(1:48)and draw a sample, without replacement, of size = 10:
sample(x = vec_1, size = 10) [1] 16 43 44 48 35 33 25 47 17 12
If you draw a sample with replacement (replace = TRUE), each draw is returned to the vector and could be drawn again.
sample(x = vec_1, size = 10, replace = TRUE) [1] 26 10 47 10 37 42 46 27 23 33
Sampling is not limited to numeric vectors
sample(x = c("a", "b", "c", "d", "e", "f", "g", "h"), size = 10, replace = TRUE) [1] "g" "d" "f" "e" "b" "c" "c" "g" "f" "d"
Without replacement, the sample size must be smaller than the length of the vector. With replacement, that is not the case. In the previous example for instance, the length of the vector was 8, while the size was 10. Without replacement, size = 8 would be equal to the vector and any size > 8 would not leave sufficient values to sample from. If some values in the sample need a higher probability of being drawn, you need to add a vector with probability weights.
As a special case, if you only include the vector x, R returns a random permutation of the vector’s values:
x <- 1:10
sample(x) [1] 6 2 9 5 7 1 3 10 8 4
Using set operators, you determine is an element in one vector is also an element in another, if that is not the case or you merge the elements of both in one new vector.
Suppose you have two vectors,
vec_1 <- c(10, 20, 30, 40)
vec_2 <- c(20, 30, 40, 40)and you want to know if both share common elements. There are various ways to check if that is the case. The first uses the intersect() function. This function has two arguments: the vectors you want to compare. Note that if you load {dplyr}, the package masks this function. To instruct R to use base R’s intersect, you need to add ´base::`. The same holds for some other functions in this section.
base::intersect(vec_1, vec_2)[1] 20 30 40
The output shows the values that these two vectors have in common. If you want to store these values, out assign them to a new vector. Note that this also allows to see how many values both vectors have in common. Using the length() function, you can verify how many (unique) values are common to both vectors:
length(base::intersect(vec_1, vec_2))[1] 3
Here, we used numeric values, but it you can finds common strings in character vectors in a similar way:
friends <- c("Monica", "Phoebe", "Joey", "Chandler", "Ross", "Rachel")
collegues <- c("Taylor", "David", "Joey", "Sandra")
base::intersect(friends, collegues)[1] "Joey"
This example also shows that the vectors don’t have to have the same length. If there are no common values, R will output the null vector:
base::intersect(c(10, 20), c(50, 60))numeric(0)
is.element(x, y) allows you to determine if elements of one vector, x, are included in the other y. The outcome is be a boolean vector whose values are TRUE if an element from x occurs in y and FALSE otherwise.
is.element(vec_1, vec_2)[1] FALSE TRUE TRUE TRUE
The values in the last three columns in vec_1 are also included in vec_2. Using the %in% operator has the same outcome as it checks which values on its left hand side vector are include in its right hand side vector:
vec_1 %in% vec_2[1] FALSE TRUE TRUE TRUE
You can also use this result to see how many elements from the first vector are also in the second. Here, you use the fact that TRUE is also 1 and FALSE is 0:
sum(is.element(vec_1, vec_2))[1] 3
Note that the order of the vectors matters. If you use is.element(x, y) you check if the elements from x are included in y. With is.element(y, x) you determine the elements in y that are also in x. In the example, you can see that changing the order in the is.element() function shows a different output as 40 is includes in vec_2 twice, but is only once included in vec_1
vec_1[is.element(vec_1, vec_2)][1] 20 30 40
vec_2[is.element(vec_2, vec_1)][1] 20 30 40 40
Recall that using the “!” you can check if a condition is not met. Here, you can use this to see which elements of x are not in y
!is.element(vec_1, vec_2)[1] TRUE FALSE FALSE FALSE
base::setdiff(x y) allows you to look for elements that are different, in other words, which elements from x are not included in y. While !is.element(x, y)’s output is a boolean vector, base::setdiff() shows the values of x that are not included in y.
base::setdiff(vec_1, vec_2)[1] 10
Note again that the order of the vectors matters.
To create a union of x and y, there is the base::union(x, y) function. This function shows the unique values after merging the values in x and y:
base::union(vec_1, vec_2)[1] 10 20 30 40
If you want to know positions of these common elements, you can use the which() function:
which(is.element(vec_1, vec_2))[1] 2 3 4
The unique(x, incomparables = FALSE) function determines the unique values in a vector. Suppose that you have a vector
vec_char <- c("jan", "jan", "feb", "mar", "mar", "apr")This vector has 4 unique values: “jan”, “feb”, “mar” and “apr”. Using the unique() function, you can select the unique values:
unique(x = vec_char)[1] "jan" "feb" "mar" "apr"
If you want to exclude one value, you can add it to the incomparables = argument. For instance, suppose that you want to see all unique values, except January, you can add incomparables = c("jan"):
unique(vec_char, incomparables = c("jan"))[1] "jan" "jan" "feb" "mar" "apr"
R will now show all occurrences of “jan” as well as the unique values of all others.
Generate a vector, vec_rn with 20 draws from a normal distribution with mean 5 and standard deviation 10
#| code-fold: true
vec_rn <- rnorm(20, mean = 5, sd = 10)Generate a vector, vec_ru with 20 draws from a uniform distribution with minimum 5 and maximum 10. Write this code without naming the arguments.
vec_ru <- runif(20, 5, 10)Using vec_rn draw a sample of 6 observations with replacement and assign these to a vector vec_rns
vec_rns <- sample(vec_rn, size = 6, replace = TRUE)A lottery includes a weekly draw of 6 numbers, without replacement, from a bowl with all numbers from 1 to 40. To play, you buy a ticket with 6 numbers, from 1 to 40. You win something if at least two numbers on your ticket are drawn. Your numbers are 3, 9, 25, 36, 37, 39. Simulate this lottery. To do so, first sample the weekly draw. Second, determine how many of your numbers match the numbers of the draw. Use 3 ways to calculate the number of winning numbers.
draw <- sample(1:40, 6, replace = FALSE)
ticket <- c(3, 9, 25, 36, 37, 39)
# Option 1: use intersect
win <- length(intersect(draw, ticket))
win[1] 1
# Option 2: use is.element
win <- sum(is.element(ticket, draw))
win[1] 1
# Option 3: use %in%
win <- sum(ticket %in% draw)
win[1] 1
R includes a number of special vectors. For instance, the vectors “letters” and “LETTERS” include the letters of the alphabet. The first lowercase, the second uppercase
letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
In addition to letters, the vectors “month.abb” and ’month.name” include the names of the month:
month.abb [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.name [1] "January" "February" "March" "April" "May" "June"
[7] "July" "August" "September" "October" "November" "December"
If you subset a vector, you select one or more columns of that vector (to possibly store them in a new one). We first start with the general case: an unnamed vector. We then continue with the special case of a named vector. Note that all methods for an unnamed vector can be used for named vectors.
Here, we will use the numeric vector vec_num:
vec_num <- c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)Note that you can apply most of these ways to subset to other vector types: character, data/time or logical vectors. The approach wouldn’t differ in case you would use any of these other vector types. There are two subset operators: [] and [[ ]].
To access an individual element of a vector, you include its position (or index number) between square brackets of the subscript operator [] after the name of the vector. In R, vector indexing starts at 1. In other words, the first element of a 1x10 vector is at position 1, the second at position 2, … This is not always the case. In Python for instance, the first element of a vector is at position 0, the second at position 1, …
Let’s look at the element 5 of the element in the fifth column of vec_num:
vec_num[5][1] 3
If you want to extract that element to use it in part of your code, you would assign it to a different vector using the <- operator:
a <- vec_num[5]
a[1] 3
Note that subsetting leaves the original vector intact. If you subset a vector, you copy the value in a new vector, but that value stays in the original vector.
You can subset more than one column. Suppose that you want to subset columns 1 to 4. To do so, you can use 1:4 within the subscript operator:
vec_num[1:4][1] 0 1 1 2
Again, you could assign this new vector. Here, this new vector would have 1 row and 4 columns. These 4 columns would be equal to the first 4 columns of the original vector.
The third way to access elements in a vector using their position is to combine these position via the c() function within the subsetting operator. The c() function allows you to define the columns you need. The subscript operator will then access these columns and extract their value. Suppose that you want to extract the elements in columns 1 and 4. Note that here, you will extract to columns: 1 and 4. In the previous example you extracted 4 columns: 1 to 4 or column 1, 2, 3, and 4. To extract columns 1 and 4 you need to include those position in the c() function: c(1, 4) and use:
vec_num[c(1, 4)][1] 0 2
Note that you can mix various ways to subset a vector. For instance, if you need the first to third, fifth and seventh to last element, you can combine the various way to subsetting the vector:
vec_num[c(1:3, 5, 7:10)][1] 0 1 1 3 8 13 21 34
You can also use negative numbers for the index elements. In that case, R will show all elements, except those in the negative index (negative index range). For instance,
vec_num[-1][1] 1 1 2 3 5 8 13 21 34
vec_num[-1:-4][1] 3 5 8 13 21 34
vec_num[(c(-2, -4))][1] 0 1 3 5 8 13 21 34
vec_num[-c(2, 4)][1] 0 1 3 5 8 13 21 34
The fourth way to subset columns in a vector uses a logical vector of the same length as the vector to subset. To see how this works, let’s first define two vectors: one numeric and one logical:
vec_1 <- c(1, 2, 3, 4, 5)
vec_log <- c(TRUE, FALSE, FALSE, FALSE, TRUE)You can now subset vec_1 using vec_log:
vec_1[vec_log][1] 1 5
If the value on position x in vec_log is “TRUE”, the result of vec_1[vec_log] is equal to the value in the xth column of vec_1. This is the case for the first and last value. If vec_log’s yth element is false, vec_1’s yth element is not extracted.
In the example, we defined the logical vector ourselfs. However, there are many other ways to create such a vector. Recall that the outcome of any boolean operation is either TRUE or FALSE. Applying a boolean operation to every column of a vector creates a logical vector of the same length as the vector where the operation was applied to. You can now select those columns that meet that condition. For instance, suppose you want to work with the elements of vec_num that are larger than 5. There are two ways to do so. First, you create a logical vector of the same length as vec_num where an element is TRUE is the element in vec_num on the same position meets the condition and false otherwise. To create that vector, you use logical vector <- original vector + condition. As we will see shortly, boolean operators applied to a vector are applied to every element of that vector. In other words, the logical vector will have the same length as the vector whose elements you want to extract.
cond <- vec_num > 5
cond [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Now we have a logical vector cond whose values are TRUE if the element in the same position in vec_num meets the condition (> 5) and FALSE otherwise. We can now use this vector to subset vec_num:
vec_num[cond][1] 8 13 21 34
Here, the TRUE-FALSE elements of cond are used to subset vec_num. Is an element in cond is TRUE, vec_num[cond] extracts that element from vec_num. If the element in cond is FALSE, the element in the same position in vec_num is not extracted.
The second option to use a condition is shorter and uses the condition within the subscript operator:
vec_num[vec_num > 5][1] 8 13 21 34
Note that you can use more than one boolean operator. For instance extracting all elements larger than 3 and not equal to 13 can be done using:
cond <- vec_num > 3 & !(vec_num == 13)
vec_num[cond][1] 5 8 21 34
or
vec_num[vec_num > 3 & !(vec_num == 13)][1] 5 8 21 34
Note that you can use these conditions also in the case of character vectors. For instance, to see if “cat” and “dog” are values in vec_char:
vec_char[vec_char == "dog" | vec_char == "cat"]character(0)
If you don’t know the exact location and you don’t have an explicit condition that you can use, but you know which values you want to extract, you can use the %in% operator. Here, you first define a vector with values, e.g. 1, 8 and 143 using c(1, 8, 143). Using the %in% operator, you can now subset the vector vec_num:
vec_num %in% c(1, 8, 143) [1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
This code extract all elements from vec_num that are also in c(1, 8, 143). The result of this operation is a logical vector with the same length as vec_num where the elements equal TRUE is an element in vec_num is included in c(1, 8, 143) and FALSE otherwise. This vector now allows you to extract these elements using, e.g.
vec_num[vec_num %in% c(1, 8, 143)][1] 1 1 8
As an alternative, you can do so in two steps:
cond <- vec_num %in% c(1, 8, 143)
vec_num[cond][1] 1 1 8
Here we included the elements to extract in a c() function. However, you can use any other vector.
a <- c(1, 8, 143)
vec_num[vec_num %in% a][1] 1 1 8
Note that you can include values in %in% c() which are not part of the vector. If this is the case, R won’t find them. Hence, adding them wouldn’t change the outcome. If no element in the original vector matches the condition, R will output numeric(0) or a vector of length 0.
To illustrate this and the use of %in% for a character vector, suppose you want to extract “dog”, “fish” from vec_char:
vec_char <- c("dog", "cat", "rabbit")
vec_char[vec_char %in% c("dog", "fish")][1] "dog"
As “fish” is not included in vec_char but “dog” is and #rabbit” is part of vec_char but isn’t in c("dog", "fish"), R outputs “dog”.
Using is.element(), you can subset a vector (the first) and extract all values that are both the the first and the second vector. To illustrate:
vec_1 <- c(1:10)
vec_2 <- c(7:15)
vec_1[is.element(vec_1, vec_2)][1] 7 8 9 10
Boolean operators allow you define many conditions. For instance, if you have a vector that includes missing values, you can extract all non missing values using !is.na() or is not (!) NA:
vec_1 <- c(1, 20, 20, NA, NA, 50)
vec_1[!is.na(vec_1)][1] 1 20 20 50
As a second example, suppose that you want to extract all even numbers. Recall that a number of even if the modulus after division by 2 is zero:
10 %% 2[1] 0
11 %% 2[1] 1
You can use this to create a condition
10 %% 2 == 0[1] TRUE
11 %% 2 == 0[1] FALSE
that you can use to subset elements of a vector:
cond <- vec_num %% 2 == 0
vec_num[cond][1] 0 2 8 34
As a third example, recall that you can use grepl() or stringr::string_detect() if a pattern occurs in a string. If the is the case, these function output TRUE. Suppose you have a character vectors
vec_char <- c("sales_shoes", "sales_trousers", "sales_shirts", "sales_jackets")and you want to extract the column which includes “shoes”. Using grepl() you can identify the elements in the character vector that include the word “shoes”:
grepl(pattern = "shoes", x = vec_char)[1] TRUE FALSE FALSE FALSE
You can use this function to extract the element “sales_shoes” from vec_char.
vec_char[grepl(pattern = "shoes", x = vec_char)][1] "sales_shoes"
In all examples we either used an exact index position or a logical vector to extract the values of a vector. What if you are not interested in a value but in an index position? To show an index position rather than its value or TRUE or FALSE, you can use the which() function. For instance, suppose you want to know the position of value 1 in vec_num. To find this position, you can use
which(vec_num == 1)[1] 2 3
The result shows the index positions where you can find 1 in vec_num. Note that you can save the output in a new vector with positions. You can now subset that vector to find the first occurrence. As an alternative, as which() outputs a vector, you can find the first occurrence subsetting the which() function. For instance, to find the first 1 in vec_num:
which(vec_num == 1)[1][1] 2
What if you want to find multiple values. Here, you can use the %in% operator. Suppose you want to know the position of the values 1, 2, 8 and 55. First you collect these values in a vector using c(1, 2, 8, 55). You can now use that vector in the which() function:
which(vec_num %in% c(1, 2, 8, 55))[1] 2 3 4 7
which() shows every occurrence. Using match() you can find the first occurrence. For instance, the first occurrence of “1” in vec_num is in position
match(1, vec_num)[1] 2
Using which() allows you to extract the positions of e.g. missing values. Suppose you have a vector vec_1 which includes missing values:
vec_1 <- c(10, 10, 20, NA, 30, 40, NA, 50, 50)To locate these missing value, you can use
which(is.na(vec_1), vec_1)[1] 4 7
There are two variants of the which() function that allow you to find the location of the (first) maximum or minimum values: which.max() and which.min:
which.max(vec_1)[1] 8
which.min(vec_1)[1] 1
Using locigal values, you can find the first occurrence of specific value. Here, which.max() uses the fact that TRUE = 1 and FALSE = 0. In other words, this function will show the first occurrence of TRUE:
which.max(vec_1 > 30)[1] 6
With a named vector, you can also use the column names to subset. Suppose that you have a vector
vec_1 <- c(A = 10, B = 30, C = 50, D = 70)First you can use the ways you would use to subset an unnamed vector, e.g.
vec_1[3]
vec_1[2:4]
vec_1[vec_1 < 50]As you can see using [] preserves the structure of the vector: the output shows both the column name as well as its value.
You can also use the name of the column to subset the vector using vec_1["column name"]:
vec_1["A"] A
10
The output shows both the column name as well as the value. In other words, here too, the structure of the vector is preserved.
To subset more than one column, you can use
vec_1[c("A", "D")] A D
10 70
To extract the value, you have to refer to the column using subsetting operator [[]]. You can do so using both the column name or number. These lines extract the value for the second column
vec_1[["B"]][1] 30
vec_1[[2]][1] 30
The output shows the value without the column. The [[]] operator simplifies the structure of the vector: it returns the simplest possible data structure: here this is the value of the column, i.e. an unnamed vector.
You can also subset column whose name includes a pattern. Recall that names() allow you to extract the names of the columns in a named vector. Using grepl() you can check if these names include a pattern. For instance, let’s check if the names of vec_1 include “A”. Using grepl():
grepl(pattern = "A", x = names(vec_1))[1] TRUE FALSE FALSE FALSE
To extract that column, you include that statement in vec_1[]:
vec_1[grepl(pattern = "A", x = names(vec_1))] A
10
The result shows the name of the column and its value.
Generate a vector:
vec <- c(21:30)Extract the following elements from this vector:
vec[5][1] 25
vec[1:5][1] 21 22 23 24 25
vec[c(1, 3, 9)][1] 21 23 29
vec[-c(1, 3, 9)][1] 22 24 25 26 27 28 30
vec[vec > 25][1] 26 27 28 29 30
Use this vector
vecchar <- c("dog", "fish", "cat", "bird", "duck", "rabbit")to extract all patterns animals that whose name includes an “a”
vecchar[grepl(pattern = "a", vecchar)][1] "cat" "rabbit"
As in the previous section, I’ll use a numeric vector here, but you can apply the rules also to other types of vectors. Suppose that you have the 1x10 vector vec_num and you want to add a column with the value 55. The first way to do so is to use the c() function to create a new vector
c(vec_num, 55) [1] 0 1 1 2 3 5 8 13 21 34 55
In this way you can add multiple columns and or multiple vectors:
c(vec_num, c(55, 89, 144), c(233, 377, 610)) [1] 0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610
c() adds all elements in the order in which they appear in the function:
c(c(610, 377, 233), c(144, 89, 55), vec_num) [1] 610 377 233 144 89 55 0 1 1 2 3 5 8 13 21 34
Note that this doesn’t change the vec_num. c()creates a new vector. If you want to change vec_num you have to reassign it to the new vector. As an alternative, you can assign the new vector to a new object:
vec_1 <- c(vec_num, c(55, 89, 144), c(233, 377, 610))
vec_1 [1] 0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610
If you have a named vector, you can add a new named vector:
vec_1 <- c(A = 10, B = 30, C = 50, D = 70)
c(vec_1, c(E = 90)) A B C D E
10 30 50 70 90
You can also use the append() function to add new elements. By default, append will add an element after the last element in the existing vector. In other words, by default, append() is similar to c(). However, the arguments in the append(vector, value, after = length(x)) allow you to change that default position. If you want to add the new element after position 3, you can add this by changing the default length(x)in 3. Note that append()doesn’t change the original vector:
append(vec_num, 55) [1] 0 1 1 2 3 5 8 13 21 34 55
vec_num [1] 0 1 1 2 3 5 8 13 21 34
If you want to change the original vector, you have to reassign it to its new values or assign the outcome to a new object:
vec_1 <- append(vec_num, 144)
vec_1 [1] 0 1 1 2 3 5 8 13 21 34 144
To add the value 88 as the first element or 143 after column 9, you can change the default location in append()’s after = argument:
append(vec_num, 88, after = 0) [1] 88 0 1 1 2 3 5 8 13 21 34
append(vec_num, 143, after = 9) [1] 0 1 1 2 3 5 8 13 21 143 34
Using the c() function, you can add multiple elements. For instance, if you want to add 88 and 143 as the first two columns of vec_num you combine these two values within c() and include them in the append statement:
append(vec_num, c(88, 143), after = 0) [1] 88 143 0 1 1 2 3 5 8 13 21 34
Note that you can change the position where these new values are added. However, all elements are added after the same position and their position follows their position within the c() function. Note also that, if you add an element whose type of different from the vector type, R will change the vector type.
You can also add a named vector
append(vec_1, c(E = 50), after = 0) E
50 0 1 1 2 3 5 8 13 21 34 144
There are multiple ways to remove elements from a vector. We already covered two. First, if you know the position of the elements you want to remove, you can use a negative index. Recall that a negative index allows you to extract the elements of a vector except those included in the negative index. For instance, if you want to remove the first 4 columns of vec_num you can do this using
vec_num[-1:-4][1] 3 5 8 13 21 34
To remove column 1 and 4 (but not 2 and 3):
vec_num[-c(1, length(vec_num))][1] 1 1 2 3 5 8 13 21
or
vec_num[c(-1, -length(vec_num))][1] 1 1 2 3 5 8 13 21
You can use this approach if you know the exact location (i.e. the columns) who want to remove.
The second way to remove elements uses a condition. For instance, the code to remove all elements larger than 3 and not equal to 0 is
vec_num[!vec_num > 3 & !(vec_num == 0)][1] 1 1 2 3
or, using a specific vector including the condition:
cond <- !vec_num > 3 & !(vec_num == 0)
vec_num[cond][1] 1 1 2 3
You can use this approach if you know the condition that elements need to meet.
If you want to remove known values from a vector, e.g. 1, 8 and 143, you can use an approach which is very similar to the one you used to subset these elements. First, you collect them in a vector c(1, 8, 143). Second, you use %in% and not (!) to remove these elements:
cond <- vec_num %in% c(1, 8, 143)
vec_num[!cond][1] 0 2 3 5 13 21 34
or, in one line of code
vec_num[!vec_num %in% c(1, 8, 143)][1] 0 2 3 5 13 21 34
In the last statement, 143 was included in the vector with values to remove but is not in vec_num. R doesn’t check if all values to be removed are also in the vector where they need to be removed.
Suppose that you know which column you want to change in your vector, e.g. you want to change the value in 4th column. To do this, you first subset that element using vec_num[4] and your reassign its value. For instance, changing the fourth element to 250:
vec_num[4] <- 250
vec_num [1] 0 1 1 250 3 5 8 13 21 34
As you can see, fourth element is now 250. Note that the new value needs to be of the same type as the vector. If that is not the case, you”ll change the type of all other elements in the vector. For instance
vec_num[4] <- "250"changes the type of the vector from double to character:
typeof(vec_num)[1] "character"
In that case, you have to change the vector’s type:
vec_num <- as.numeric(vec_num)Using replace() you can change many values in a vector. Suppose you want to change columns 1, 8 and 10 in 50, 100, 150. The first argument in the replace() function is the vector you want to change. Here, this is vec_num. The second argument is a vector with index position. Using c(1, 8, 10) you can fix these position. The last argument is a vector with the values that will be used to replace the values in the index positions. Here you would use c(50, 100, 150). Using these in the replace() function:
replace(vec_num, c(1, 8, 10), c(50, 100, 150)) [1] 50 1 1 250 3 5 8 100 21 150
Note that the length of the index vector and the length of the vector with new values should be equal. If this is not the case, R will show an error:
replace(vec_num, c(1, 8, 10), c(50, 100, 150, 200))Warning in x[list] <- values: number of items to replace is not a multiple of
replacement length
[1] 50 1 1 250 3 5 8 100 21 150
If you want to replace all values that meet a certain condition with one single value, you can use the replace() function as well. Suppose you want to change all values larger than 25 with 50. Using recplace() you could do this with:
replace(vec_num, vec_num > 25, 50) [1] 0 1 1 50 3 5 8 13 21 50
Changing the vector’s type is another way to change a vector. Suppose you have a vector
vec_dat_char <- c("01-01-2025", "02-01-2025", "03-01-2025")This vector is a character vector:
typeof(vec_dat_char)[1] "character"
You can change this type to Date or POSIX using as.Date() or as.POSIXct(). Using the first:
as.Date(vec_dat_char, format = "%d-%m-%Y")[1] "2025-01-01" "2025-01-02" "2025-01-03"
In a similar way, you can change the typeof numeric variables in character, dates in numeric, … .
To sort a vector, R includes the sort(x, decreasing = FALSE, na.last = NA) function. Here, x is the vector to sort. By default, R sorts in increasing order. The last argument includes the treatement of “NA” values. By default, they are removed. Using TRUE missing values are retained, but added last. FALSE shows these values first.
sort(x = vec_num, decreasing = FALSE) [1] 0 1 1 3 5 8 13 21 34 250
Character vectors are sorted alphabetically by default:
sort(x = c("zoo", "Zoo", "coast", "coAst", "cOAst", "lake"))[1] "coast" "coAst" "cOAst" "lake" "zoo" "Zoo"
As you can see, if the strings include copies where one includes a uppercase letter and the other one doesn’t, R orders those with the lowest number of uppercase letters first.
Generate a vector:
vec <- c(21:30)Change the this vector
c(31, 32, 33, 34, 34) after the vast position in vec. Use two methods to do so. Store the results in vec_r:# Option 1: use c()
vec_r <- c(vec, c(31, 32, 33, 34, 34))
vec_r [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 34
# Option 2: use append()
vec_r <- append(vec, c(31, 32, 33, 34, 34))
vec_r [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 34
c(31, 32, 33, 34, 34) as the first elements of vec. Use two methods to do so. Store the results in vec_r:# Option 1: use c()
vec_r <- c(c(31, 32, 33, 34), vec)
vec_r [1] 31 32 33 34 21 22 23 24 25 26 27 28 29 30
# Option 2: use append() and add position
vec_r <- append(vec, c(31, 32, 33, 34), after = 0)
vec_r [1] 31 32 33 34 21 22 23 24 25 26 27 28 29 30
c(31, 32, 33, 34, 34) after the fifth element of vec. Store the results in vec_r:vec_r <- append(vec, c(31, 32, 33, 34, 34), after = 5)
vec_r [1] 21 22 23 24 25 31 32 33 34 34 26 27 28 29 30
Using vec_r you created in the last exercise:
vec_r <- vec_r[-6:-10]
vec_r [1] 21 22 23 24 25 26 27 28 29 30
vec_r[5] <- 250vec_r.vec_r <- replace(vec_r, c(1, 2, 3), c(210, 220, 230))
vec_r [1] 210 220 230 24 250 26 27 28 29 30
replace(vec_r, vec_r < 100, 100) [1] 210 220 230 100 250 100 100 100 100 100
Using vec, sort this vector in decreasing and increasing order.
sort(vec, decreasing = TRUE) [1] 30 29 28 27 26 25 24 23 22 21
sort(vec) [1] 21 22 23 24 25 26 27 28 29 30
Many operations in R are vectorized. This means that an operator works on a vector’s individual elements. For functions, that means that R, for most of them, applies them to every element of that vector.
We introduced mathematical operators and function, statistical function and e.g. rounding in the previous chapter. Almost all these are vectorized. All operators and function generate output. In you want to store these results you have to assign them to a new object. Here this object is usually a vector. In the examples this assignment is left out to keep code short.
Let’s first create a vector, vec_num1 and vec_num2
vec_num1 <- c(10, 10, 20, 30, 50, 80, 130, 210, 340, 550)
vec_num2 <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55)If you add, subtract a numeric value to or from a vector or if you multiply that numeric vector with of divide it by a numeric value, R applies this operation to every element of the vector. For instance
vec_num1 + 100 [1] 110 110 120 130 150 180 230 310 440 650
vec_num1 - 100 [1] -90 -90 -80 -70 -50 -20 30 110 240 450
vec_num1 * 10 [1] 100 100 200 300 500 800 1300 2100 3400 5500
vec_num1 / 25 [1] 0.4 0.4 0.8 1.2 2.0 3.2 5.2 8.4 13.6 22.0
vec_num1 %/% 3 [1] 3 3 6 10 16 26 43 70 113 183
vec_num1 %% 3 [1] 1 1 2 0 2 2 1 0 1 1
Applied to two vectors of the same length, R add, subtracts, multiplies or divides each element in one vector to/from/with the corresponding element in the other vector:
vec_num1 + vec_num2 [1] 11 11 22 33 55 88 143 231 374 605
vec_num1 - vec_num2 [1] 9 9 18 27 45 72 117 189 306 495
vec_num1 * vec_num2 [1] 10 10 40 90 250 640 1690 4410 11560 30250
vec_num1 / vec_num2 [1] 10 10 10 10 10 10 10 10 10 10
vec_num1 %/% vec_num2 [1] 10 10 10 10 10 10 10 10 10 10
vec_num1 %% vec_num2 [1] 0 0 0 0 0 0 0 0 0 0
Note that this save a lot of work. Without vectorization, to add two vectors, you would have to write some code, e.g.:
if (length(vec_num1) != length(vec_num2)) {
print("Can not add vectors of a different length")
} else {
vec_num4 <- vector("numeric", length = length(vec_num1))
for (i in 1:length(vec_num1)) {
vec_num4[i] <- vec_num1[i] + vec_num2[i]
}
}
vec_num4 [1] 11 11 22 33 55 88 143 231 374 605
For functions, let’s illustrate vectorisation using the of vec_num1. All functions where introduced in previous sections.
abs(-vec_num1) [1] 10 10 20 30 50 80 130 210 340 550
log(vec_num1) [1] 2.302585 2.302585 2.995732 3.401197 3.912023 4.382027 4.867534 5.347108
[9] 5.828946 6.309918
log10(vec_num1) [1] 1.000000 1.000000 1.301030 1.477121 1.698970 1.903090 2.113943 2.322219
[9] 2.531479 2.740363
log(vec_num1, base = 10) [1] 1.000000 1.000000 1.301030 1.477121 1.698970 1.903090 2.113943 2.322219
[9] 2.531479 2.740363
sqrt(vec_num1) [1] 3.162278 3.162278 4.472136 5.477226 7.071068 8.944272 11.401754
[8] 14.491377 18.439089 23.452079
vec_num1^2 [1] 100 100 400 900 2500 6400 16900 44100 115600 302500
exp(vec_num1) [1] 2.202647e+04 2.202647e+04 4.851652e+08 1.068647e+13 5.184706e+21
[6] 5.540622e+34 2.872650e+56 1.591627e+91 4.572186e+147 7.277212e+238
Although R has many useful vector functions, I’ll introduce a couple of them here. To illustrate what they do, we’ll use
vec_num1 <- c(1, 2, 3, 4, 3, 2, 1)cumsum(x) shows the cumulative sum of a vector. It’s first element is the first element of x; its second element is the sum of its first element and the second element of x; the third equals its second element (or the sum of the first two elements in x) plus the third element of x, … . If one of the elements is a missing value (NA), the rest of the sum will be set to NA.
cumsum(vec_num1)[1] 1 3 6 10 13 15 16
As you can see, the second element is equal to 2 + 1, the first two elements in x. The third element, 6, is equal to the second element in the cumulative sum (3) and the third element in vec_num1 … .
cumprod(x) is a similar function but calculates the cumulative product.
cumprod(vec_num1)[1] 1 2 6 24 72 144 144
cummax(x) and cummin(x) produce a vector with cumulative maximum and minimum values. The first starts with the first observation in x and use this as their first element. If the second elemen in x is larger than the first, the second element in the output vector for the cummax() function will equal that value; else is will equal its first value. The function then evaluates the third element in x. If that element is larger then the second element in the output vector for cummax() the third element in the cummax() vector will be that third element in the x vector; else the third element in the cummax() vector equals its second element. To see how this works:
cummax(vec_num1)[1] 1 2 3 4 4 4 4
As you can see, the first element is 1. As the second element in vec_num1 is 2, this is a new maximum and the cummax() vector’s second element in 2? The same holds for the third element in vec_num1: it is larger than the second element in the cummax() vector, so this is a new maximum. The third element in the cummax() vector shows this. After the fourth element, all elements in vec_num1 are smaller then its maximum value. In the cummax() vector, the maximum is now stable.
cummin() is similar, but sets the minimum:
cummin(vec_num1)[1] 1 1 1 1 1 1 1
The {purrr} package includes a function reduce() which is very useful with vectors. This function reduces elements of a vector in a single value using a 2-argument function that passes the accumulated value as this functions second argument. The cumsum() and cumprod() function’s last value equal the sum and product of all elements in the vector but also shows all intermediate cumulative sums. You can calculate that final value using purrr::reduce(.x, .f, ..., .init, .dir = c("forward", "backward")). The first argument, .x is an atomic vector. The second argument .f is a function that will be used across elements. This function needs to arguments: the first is an element from the vector; the second is the accumulated values from the previous step. The arguments .init and .dir = c("forward", "backward") show the initial value and the direction of the reduction with “forward” being the default. The default value for the initial value is the first element of x. To calculate the cumulative sum using this function:
purrr::reduce(vec_num1, .f = sum)[1] 16
or even simpler:
purrr::reduce(vec_num1, `+`)[1] 16
and the cumulative product:
purrr::reduce(vec_num1, `*`)[1] 144
Note that this function is not limited to + or -, but can be used with, e.g. /
purrr::reduce(vec_num1, `/`)[1] 0.006944444
If you only need the total sum of all vector elements, you can use sum(x, na.rm = FALSE):
sum(vec_num1)[1] 16
Likewise, the product of all elements in a vector can be computed using prod(x, na.rm = FALSE):
prod(vec_num1)[1] 144
Create a vector vec_1 as a sequence from 1 to 20
vec_1 <- 1:10Use this vector to
log(vec_1, base = 10) [1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513 0.8450980
[8] 0.9030900 0.9542425 1.0000000
log10(vec_1) [1] 0.0000000 0.3010300 0.4771213 0.6020600 0.6989700 0.7781513 0.8450980
[8] 0.9030900 0.9542425 1.0000000
vec_2 <- vec_1 * 2vec_2 - vec_1 [1] 1 2 3 4 5 6 7 8 9 10
vec_1. Store the results in vec_1s and vec_1p:vec_1s = cumsum(vec_1)
vec_1p = cumprod(vec_1)vec_1. To do so, assume that you don’t know the number of columns in this vector.vec_1s[length(vec_1)][1] 55
vec_1p[length(vec_1)][1] 3628800
Calculate the total sum and total produce of vec_1 in two other ways
vec_1# Option 1
sum(vec_1)[1] 55
# Option 2:
purrr::reduce(vec_1, sum)[1] 55
vec_1# Option 1
prod(vec_1)[1] 3628800
# Option 2:
purrr::reduce(vec_1, `*`)[1] 3628800
The “r”-variants of the distribution functions such as rnorm were covered in a previous section. Here, we will (re-) introduce the other variants. Recall that we covered three. Applied to the normal distribution, these where pnorm(), dnorm() and qnorm(). We’ll use the vector vec_stat to illustrate these functions
vec_stat <- c(-1.959964, -1.64448, -1.281552, 0, 1.281552, 1.64448, 1.959964)pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) shows the probability that a value is smaller then or equal to q, by default for a standard normal distribution. Changing the default value from lower.tail = TRUE to FALSE shows the probability that a value of larger then q. For vec_stat, these values are equal topnorm(q = vec_stat, lower.tail = TRUE)[1] 0.02500000 0.05003855 0.09999992 0.50000000 0.90000008 0.94996145 0.97500000
pnorm(q = vec_stat, lower.tail = FALSE)[1] 0.97500000 0.94996145 0.90000008 0.50000000 0.09999992 0.05003855 0.02500000
dnorm(x, mean = 0, sd = 1, log = FALSE) shows the probability of xdnorm(x = vec_stat)[1] 0.05844507 0.10319904 0.17549823 0.39894228 0.17549823 0.10319904 0.05844507
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) for which value the condition holds that its probability (probability that a value is smaller than or equal to) is equal to the values in p. Applied to c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975). The default lower.tail = TRUE has the same interpretation as in pnorm():qnorm(p = c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975), lower.tail = TRUE)[1] -1.959964 -1.644854 -1.281552 1.281552 1.644854 1.959964
qnorm(p = c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975), lower.tail = FALSE)[1] 1.959964 1.644854 1.281552 -1.281552 -1.644854 -1.959964
For all other function, Student’s t, Chi-square, uniform, F, you can apply similar functions.
In addition to these probability functions, there are many function that summarize a vector. These include function for central tendency and location (mean, median, …), for the level of dispersion and skewness. To illustrate these functions, we’ll use
vec_norm <- rnorm(100, 5, 10)Here we will focus on functions that you can use to summarise the data: mean(x, trim = 0, na.rm = FALSE) calculates the mean of a vector. The second argument, trim = 0 can be used to remove observations at each end before computing the mean. For instance, trim = 0.10 would remove the smallest and largest 10% of all values and calculate the mean with the middle 80%. By default, this is 0. na.rm = FALSE tells are that it shouldn’t remove missing values. If the vector includes missing values, and the default FALSE is left, the result of this function will be NA.
mean(x = vec_norm, na.rm = TRUE)[1] 4.414747
mean(x = vec_norm, trim = 0.10, na.rm = TRUE)[1] 4.443764
The median(x = , na.rm = FALSE) function calculates the median. Here again, you need to specify how to handle missing observations.
median(vec_norm, na.rm = TRUE)[1] 3.787437
Note that the median is a special case of quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE). This function allows you to compute the quantiles of a distribution. By default, the function calculates the minimum, the 25th percentile, the median (50th percentile), the 75th percentile and the maximum. You can see this in the probs = seq(0, 1, 0.25). Recall that seq(0, 1, 0.25) produces a vector (0, 0.25, 0.50, 0.75, 1). These probabilities correspond to default values. You can change this the default if you include your own values using, e.g. c(0.10, 0.25, 0.50, 0.75, 0.90). This option would show the 1st, 2nd and 3rd quartile (25th, 50th and 75th percentile) in addition to the 10th and 90th percentile. To see all deciles, you can use seq(0.10, 0.90, 0.10) as the value for probs. The last options tells R it needs to add names to the values (e.g. Min, 1st Qu, Median, 10% …). If you set this value to FALSE, these names are dropped. If you save these results in a new vector, you can subset them using both the subsetting methods for named and unnamed vectors. To see the 10th and 90th percentile as well as the 1st, 2nd and 3th quartile of vec_stat:
vec_quan <- quantile(x = vec_stat, probs = c(0.10, 0.25, 0.50, 0.75, 0.90), na.rm = TRUE, names = TRUE)
vec_quan 10% 25% 50% 75% 90%
-1.770674 -1.463016 0.000000 1.463016 1.770674
You can now subset vec_quan:
vec_quan[1] 10%
-1.770674
vec_quan["75%"] 75%
1.463016
To see the minimum and maximum values, you can use min() and max(). Other than a vector, these functions allow you to set the default na.rm from false into TRUE:
min(vec_norm, na.rm = TRUE)[1] -15.86817
max(vec_norm, na.rm = TRUE)[1] 27.46498
The summary() function shows the mean and median as well as the minimum, maximum and the 1st and 3rd quartile. This function returns a table. If you save the results, you can subset this table using the traditional subsetting rules for named and unnamed vectors.
tab_sum <- summary(vec_norm)
tab_sum Min. 1st Qu. Median Mean 3rd Qu. Max.
-15.868 -3.233 3.787 4.415 12.944 27.465
If you use a name to subset, note that the name of some summary statistics include a ‘.’ at the end:
tab_sum["3rd Qu."] 3rd Qu.
12.94374
Often used measure if dispersion include the range, the minimum and maximum values; the interquartile range or the difference between the 3rd and 1st quartile, the variance and standard deviation. To use the range() function, you need to supply it with the a vector. The other argument is na.rm = FALSE by default. The function shows the minmum and maxium value. Note that these statistics are also includes in e.g. summary(), min() and max() and you can also select them using quantiles().
range(vec_norm, na.rm = TRUE)[1] -15.86817 27.46498
To calculate the interquartile range of IQR, you can use IQR(). The most important arguments of this function include the vector and na.rm:
IQR(x = vec_norm, na.rm = TRUE)[1] 16.177
To compute the variance function you can use var(x, na.rm = FALSE). You can calculate the standard deviation either as the square root of the variance or using sd(x, na.rm = FALSE). In both functions, x is the vector whose variance or standard deviation you need to compute;
var(x = vec_norm, na.rm = TRUE)[1] 104.2964
sqrt(var(x = vec_norm, na.rm = TRUE))[1] 10.21256
sd(x = vec_norm, na.rm = TRUE)[1] 10.21256
To calculate moments larger then 2, you can use the {moments} package. This package includes functions such as skewness() and kurtosis(). You can use these to calculate the third and fourth moment of the distribution. For higher order moments, you can use moment(x, order = 1, central = FALSE, absolute = FALSE, na.rm = FALSE). The order = argument allows you to set the order (e.g. 2 for variance, 3 for skewness, …). To set moments around the mean (e.g. like you would do to calculate the variance), set central = TRUE. To use this package, you have to install it first.
Create a vector, vec_rn with 100 draws from a normal distribution with mean 5 and standard deviation 5 and vec_rt with 100 draws from Student’s t-distribution with 10 degree of freedom
vec_rn <- rnorm(100, 5, 5)
vec_rt <- rt(100, 10)Determine the probability that you find values larger than each of the elements in c(1.65, 1.75, 2.10) if these values follow a t-distribution with 5 degrees of freedom.
pt(c(1.65, 1.75, 2.10), df = 5, lower.tail = FALSE)[1] 0.07992788 0.07026118 0.04487662
For a Chi-square distribution with 10 degrees of freedom, determine the values for which holds that the probabilities that you find a value smaller than or equal to that value are equal to c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975)
qchisq(c(0.025, 0.05, 0.10, 0.90, 0.95, 0.975), df = 10)[1] 3.246973 3.940299 4.865182 15.987179 18.307038 20.483177
Using vec_rn determine:
mean(vec_rn, na.rm = TRUE)[1] 5.310853
mean(vec_rn, trim = 0.10, na.rm = TRUE)[1] 5.312695
median(vec_rn, na.rm = TRUE)[1] 5.104114
quantile(vec_rn, na.rm = TRUE) 0% 25% 50% 75% 100%
-5.807568 1.486231 5.104114 8.714391 15.770025
min(vec_rn)[1] -5.807568
max(vec_rn)[1] 15.77002
range(vec_rn, na.rm = TRUE)[1] -5.807568 15.770025
IQR(vec_rn, na.rm = TRUE)[1] 7.228161
sd(vec_rn, na.rm = TRUE)[1] 5.049125
var(vec_rn, na.rm = TRUE)[1] 25.49367
Rounding numeric values uses round(), floor(), ceiling(), trunc() or signif(). Applied to a numeric vector, these function output a vector with rounded data. To illustrate, let’s first take of natural logarithm of vec_num1 and use this vector to show how these functions work.
vec_num3 <- log(vec_num1)round(x, digits = 0): rounds x to n decimal places. With n = 2round(vec_num3, digits = 2)[1] 0.00 0.69 1.10 1.39 1.10 0.69 0.00
floor(vec_num3)[1] 0 0 1 1 1 0 0
ceiling(x): rounds to the smallest integer not less than the value in x:ceiling(vec_num3)[1] 0 1 2 2 2 1 0
trunc(x): removes all decimal places:trunc(vec_num3)[1] 0 0 1 1 1 0 0
signif(x, digits = 6) rounds values in x to the specified number of significant digits. Applied to c(123456, 654321, 147258, 852147):signif(x = c(123456, 654321, 147258, 852147), digits = 4)[1] 123500 654300 147300 852100
Boolean operators work element wise. For instance, to check if the values in vec_num1 are larger than 50:
vec_num1 > 50[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Recall that we used this observation to subset a vector. There are two other useful function to apply to vectors: any()and all(). The first checks if at least one of the values is TRUE. The other is all values are TRUE. For instance, to check is any of the values in vec_num are larger than 450:
any(vec_num1 > 450, na.rm = TRUE)[1] FALSE
You can use all()to check is a conditions holds for all elements in a vector. For instance, to see if all elements are positive, you can use
all(vec_num1 > 1, na.rm = TRUE)[1] FALSE
In Chapter 3, we introduced character functions. Here, we will see how they can be used with character vectors. Note that most {stringr} function return a list. We will meet lists in this chapter and see how you can use them in your analysis. Here, we will only use their properties if needed. Most base R functions output a vector. Sometimes, this makes them easier to use.
We already looked at paste() and paste0(). You can use these functions to generate a series of numbers as characters:
paste(1:5)[1] "1" "2" "3" "4" "5"
If you apply these function to vectors including several character values, you can change the collapse = NULL to show these character values in 1 string. To see what these functions do, let’s use:
vec_char1 <- c("dog", "cat", "fish")Let’s first use paste()
paste(vec_char1, collapse = " __ ")[1] "dog __ cat __ fish"
As you can see, the three elements of the character vector are now one character, seperated by “__”. If you use paste0() you have the same result:
paste0(vec_char1, collapse = "**")[1] "dog**cat**fish"
Both function also allow you to create variable names such as “var_1”, “var_2”, …
paste("var", 1:5, sep = "_")[1] "var_1" "var_2" "var_3" "var_4" "var_5"
paste0("var", 1:5)[1] "var1" "var2" "var3" "var4" "var5"
Note that here you leave the collapse default, as you want to keep the various character values as separate values in a character vector. For instance, because you will use them as names in a dataset.
When we introduced string variables, we introduced regular expressions. Recall that you can include these regular expressions in {stringr}’s function as well as in e.g. grep() or grepl(). These are especially useful if you have a character vector. Let’s define a character vector first. For instance, a list of product codes could look like
vec_char3 <- c("25-78T", "25-98S", "45-97Q", "45-74Q", "45-72T", "55-10T", "55-48T", "55-69Q", "55+178T",
"173+8W", "235+9W", "125+1W", "274+2Q", "274+5Q", "751+9Q", "274+1W", "274+4W", "274+5Q", "751+6Q")Suppose that you need to extract all product codes that start with 55 and end with T. As you can see in the character vector, there are two matches: “55-10T”, “55-48T” and 55+178T. You don’t want to extract e.g. “55-69-Q” or “25-68-T”. This is where regular expressions are very useful. Recall that regular expressions allow you to identify patterns using e.g. “[A-Z][a-z]+”, “\\d” or “[0-9]{4}”. The first searches for pattern in characters that start with a capital letter A to Z and are followed by one or more normal letters a to z. The second searches for numbers and the third for numbers with exact 4 repetitions. In the second, we use the backslash “\”. Recall that is is the escape operator for regular expressions. Using “d” in a regular expression would search for the occurrence of the letter “d”. To tell R it has to look for digits, we need to “escape” the usual meaning of “d”. To do so, we use “\” to escape the usual interpretation “d”. As “\” also has a specific meaning in R - it is the escape character - we need to tell R not to use the literal interpretation of “\” but interpret it as an escape character. This is where the second backslash enters. In other words the first backslash on the left “\\d” tells R to escape the literal interpretation of the second backslash from the left. The second backslash is now an escape character. With that escape character, the interpretation of “d” changes: we don’t want the literal interpretation of “d” but the “d” as referring to all digits interpretation. The same holds e.g. if you want to escape the literal interpretation of a dot “.”: you need to use “\\ .”. Doing so, R will not look for a “.” but will find any character.
If you want to avoid backslashes for special character such as ., *, {, }, +, ^, $, |, ?, (, ), you can include them between square brackets [] . For instance [.] or [ ?] will look for a literal “.” or “?”.
To search for a specific word or part of a word, you can use that word or its part. Suppose that you have a character vector
vec_char1 <- c("pineapple", "strawberry", "blackberry", "apple", "banana", "grapefruit", "melon", "cranberry", "kiwi", "lemon")To extract all elements including “berry”, you can use e.g. grep(). Recall that this function’s first argument is the pattern, the second the character vector where grep() will look to pattern matches, the option to set ignore.case = FALSE to TRUE and the option to show the value as opposed to the position by changing the default value = FALSE to TRUE. To show the pattern match, you’ll also see the output from {stringr}’s str_view(x, pattern, match = NA) function. This functions shows the matches in the character vector. Changing match = NA in match = TRUE limits the output from this function to the matched strings. Adding html = TRUE shows an html widget in the Viewer tab of the files pane.
grep(pattern = "berry", x = vec_char1, value = TRUE)[1] "strawberry" "blackberry" "cranberry"
stringr::str_view(vec_char1, "berry", match = NA ) [1] │ pineapple
[2] │ straw<berry>
[3] │ black<berry>
[4] │ apple
[5] │ banana
[6] │ grapefruit
[7] │ melon
[8] │ cran<berry>
[9] │ kiwi
[10] │ lemon
To see the position of these values:
grep(pattern = "berry", x = vec_char1, value = FALSE)[1] 2 3 8
stringr::str_view(vec_char1, "berry", match = NA ) [1] │ pineapple
[2] │ straw<berry>
[3] │ black<berry>
[4] │ apple
[5] │ banana
[6] │ grapefruit
[7] │ melon
[8] │ cran<berry>
[9] │ kiwi
[10] │ lemon
Including the pattern “an” shows all matches that include the letter “an”:
grep(pattern = "an", x = vec_char1, value = TRUE)[1] "banana" "cranberry"
stringr::str_view(vec_char1, "an", match = NA) [1] │ pineapple
[2] │ strawberry
[3] │ blackberry
[4] │ apple
[5] │ b<an><an>a
[6] │ grapefruit
[7] │ melon
[8] │ cr<an>berry
[9] │ kiwi
[10] │ lemon
Using “|” you can add various patterns. For instance, using “an|be” matches all elements in the string that include “an” or “be”.
grep(pattern = "an|be", x = vec_char1, value = TRUE)[1] "strawberry" "blackberry" "banana" "cranberry"
stringr::str_view(vec_char1, "an|be", match = NA) [1] │ pineapple
[2] │ straw<be>rry
[3] │ black<be>rry
[4] │ apple
[5] │ b<an><an>a
[6] │ grapefruit
[7] │ melon
[8] │ cr<an><be>rry
[9] │ kiwi
[10] │ lemon
In these examples, the pattern is fixed: “berry”. However, often the patterns are less clear. Regular expressions allow you to build more complex patterns.
First you can combine one or more characters (letter, numbers, symbols) sets with qualifiers to control for the number of occurrences and anchors that determine where a pattern occurs. Let’s start with the first. A character set is included between []. For instance, [rst] matches lowercase “r”, “s”, or “t”; [aeiou] matches all lowercase vowels “a”, “e”, “i”, “o” or “u”, [IJK] matches all uppercase “I”, “J”, or “K” and [aBcD] matches lowercase “a” or “c” or uppercase “B” or “D”. In a similar way you can match numbers. For instance [0123] matches “0”, “1”, “2” or “3”. You can include letters and numbers in your set: [a1B2c] matches “a”, “1”, “B”, “2” or “c”. Note that you can combine character sets in one regular expression. For instance [aeiou]b[aeiou] searches for a pattern: a vowel, the letter b and another vowel. For instance, to search for the pattern: any letter from “m”, “n”, “o”, “p”, “q” or “r” followed by an “a” followed by any letter from “m”, “n”, “o”, “p”, “q” or “r”:
grep("[mnopqr]a[mnopqr]", vec_char1, value = TRUE)[1] "banana" "grapefruit" "cranberry"
stringr::str_view(vec_char1, "[mnoprq]a[mnopqr]", match = NA) [1] │ pineapple
[2] │ strawberry
[3] │ blackberry
[4] │ apple
[5] │ ba<nan>a
[6] │ g<rap>efruit
[7] │ melon
[8] │ c<ran>berry
[9] │ kiwi
[10] │ lemon
To include special characters in the set, you need the escape character. For instance [a-z\.] matches “a” to “z” as well as “.”, [$] matches “$” and [{}] matches “{” or “}”.
If you use - within your character set, you can define a range: [a-z] matches all lower case letters starting from a and running across the alphabet to z. If you change the “a” or “z” you can restrict the range, e.g. [k-n] matches “k”, “l”, “m”, or “n” Using uppercase allows you to define a range of uppercase letters: [B-E] matches “B”, “C”, “D” or “E”. Adding both, e.g. [a-zA-A] or [A-Za-Z] matches any character “a” to “z” or “A” to “Z”. Using numbers, [0-9] matches all numbers from 0 to 9 while [1-3] matches all numbers from 1 to 3. To see how these ranges work, let’s look for the pattern: any letter from “a” to “m” followed by an “e” followed by any letter from “n” to “z” in vec_char1:
grep("[a-m]e[n-z]", vec_char1, value = TRUE)[1] "strawberry" "blackberry" "cranberry"
stringr::str_view(vec_char1, "[a-m]e[n-z]", match = NA) [1] │ pineapple
[2] │ straw<ber>ry
[3] │ black<ber>ry
[4] │ apple
[5] │ banana
[6] │ grapefruit
[7] │ melon
[8] │ cran<ber>ry
[9] │ kiwi
[10] │ lemon
Including the carat sign “^” within a character set works as a negation. For instance [^a-k] matches all lowercase letters except “a” to “k”, [^qrt] matches all letters except “q”, “r” or “t”. Used with digits, [^3-9] matches all except “3” to “9”. To see how this works, the use the carat sign in the previous regular expression: “[^a-m]e[^n-z]” a letter not from “a” to “m” followed by an “e” followed by a letter not from “n” to “z”:
grep("[^a-m]e[^n-z]", vec_char1, value = TRUE)[1] "pineapple" "grapefruit"
stringr::str_view(vec_char1, "[^a-m]e[^n-z]", match = NA) [1] │ pi<nea>pple
[2] │ strawberry
[3] │ blackberry
[4] │ apple
[5] │ banana
[6] │ gra<pef>ruit
[7] │ melon
[8] │ cranberry
[9] │ kiwi
[10] │ lemon
In addition to these character sets, there are meta characters and shortcuts that have their own meaning. With respect to the metacharacters, you can use “.” to refer to any single character. In other words “a..b” matches all patterns that start with a, and with b and have two characters between them. For instance if you want to find matches “a..e” in vec_char1 R will look at all occurrences of “a” followed by 2 other characters and ending with “e”:
grep("a..e", vec_char1, value = TRUE)[1] "strawberry" "cranberry"
stringr::str_view(vec_char1, "a..e", match = NA) [1] │ pineapple
[2] │ str<awbe>rry
[3] │ blackberry
[4] │ apple
[5] │ banana
[6] │ grapefruit
[7] │ melon
[8] │ cr<anbe>rry
[9] │ kiwi
[10] │ lemon
With respect to the shortcuts, they include:
Note that there you need the escape character. Using these shortcuts, you can replace [0-9] with \d, search for whitespaces using \s, … . As an example:
grep("\\d", c("125", "abc! ", " "), value = TRUE)[1] "125"
grep("[0-9]", c("125", "abc! ", " "), value = TRUE)[1] "125"
stringr::str_view(c("125", "abc! ", " "), "\\d", match = NA)[1] │ <1><2><5>
[2] │ abc!
[3] │
grep("\\D", c("125", "abc! ", " "), value = TRUE)[1] "abc! " " "
stringr::str_view(c("125", "abc! ", " "), "\\D", match = NA)[1] │ 125
[2] │ <a><b><c><!>< >
[3] │ < >
grep("\\w", c("125", "abc! ", " "), value = TRUE)[1] "125" "abc! "
stringr::str_view(c("125", "abc! ", " "), "\\w", match = NA)[1] │ <1><2><5>
[2] │ <a><b><c>!
[3] │
grep("\\W", c("125", "abc! "), value = TRUE)[1] "abc! "
stringr::str_view(c("125", "abc! ", " "), "\\W", match = NA)[1] │ 125
[2] │ abc<!>< >
[3] │ < >
grep("\\s", c("125", "abc! ", " "), value = TRUE)[1] "abc! " " "
stringr::str_view(c("125", "abc! ", " "), "\\s", match = NA)[1] │ 125
[2] │ abc!< >
[3] │ < >
grep("\\S", c("125", "abc! ", " "), value = TRUE)[1] "125" "abc! "
stringr::str_view(c("125", "abc! ", " "), "\\S", match = NA)[1] │ <1><2><5>
[2] │ <a><b><c><!>
[3] │
Anchors determine the location of a pattern. Using the carat \^ the pattern needs to be located at the start of the string. In other words ^r will match pattern starting with an “r”. Note here the difference in result if \^ is used withing [ ] and before a string. Within square brackets, it works to exclude the letters or numbers withing the square brackets. Starting a string with the carat sign, works to determine the position of a character. Ending a pattern with a \$ means that the pattern should be at the end of a string. In other words y$ matches a y at the end of the string. Using \b, you locate at pattern at the end of a word (e.g. before a space, dash, comma, semi colon, dot, …) while \B matches any non-word boundary:
For example:
grep("^b", vec_char1, value = TRUE)[1] "blackberry" "banana"
stringr::str_view(vec_char1, "^b", match = NA) [1] │ pineapple
[2] │ strawberry
[3] │ <b>lackberry
[4] │ apple
[5] │ <b>anana
[6] │ grapefruit
[7] │ melon
[8] │ cranberry
[9] │ kiwi
[10] │ lemon
grep("y$", vec_char1, value = TRUE)[1] "strawberry" "blackberry" "cranberry"
stringr::str_view(vec_char1, "y$", match = NA) [1] │ pineapple
[2] │ strawberr<y>
[3] │ blackberr<y>
[4] │ apple
[5] │ banana
[6] │ grapefruit
[7] │ melon
[8] │ cranberr<y>
[9] │ kiwi
[10] │ lemon
grep("e\\b", c("average costs", "total sales", "total revenues"), value = TRUE)[1] "average costs"
stringr::str_view(c("average costs", "total sales", "total revenues"), "e\\b", match = NA)[1] │ averag<e> costs
[2] │ total sales
[3] │ total revenues
grep("s\\B", c("average costs", "total sales"), value = TRUE)[1] "average costs" "total sales"
stringr::str_view(c("average cost", "total sales", "total revenues"), "s\\B", match = NA)[1] │ average co<s>t
[2] │ total <s>ales
[3] │ total revenues
Using word boundaries, you can match occurrences of e.g. individual numbers not included in another one. For instance, to identify the number “2” as a number not included in e.g. “125” or “210” you can use
grep("\\b2\\b", c("125", "2", "210"), value = TRUE)[1] "2"
Quantifiers control the number of occurrences of a pattern. Using \+ at the end of a pattern means that this pattern can be repeated once or more times. These repetitions can occur throughout the string if they are not followed by another part of the regular expression. For instance “[a-z]\+” means that a letter from “a” to “z” can occur once but also multiple times. Using \* is used when a pattern doesn’t have to occur or could occur with one of multiple repetitions. With \? you need at most one repetition. In other words, the pattern before \? is optional. Using \{x\} fixed the number of repetitions to x while \{x, y\} sets the number of repetition between x or y.
Here are a couple of examples:
grep("p+", vec_char1, value = TRUE)[1] "pineapple" "apple" "grapefruit"
stringr::str_view(vec_char1, "p+", match = NA) [1] │ <p>inea<pp>le
[2] │ strawberry
[3] │ blackberry
[4] │ a<pp>le
[5] │ banana
[6] │ gra<p>efruit
[7] │ melon
[8] │ cranberry
[9] │ kiwi
[10] │ lemon
grep("abcQ?abc", c("abcQabc", "abcabc", "abc_abc"), value = TRUE)[1] "abcQabc" "abcabc"
stringr::str_view(c("abcQabc", "abcabc", "abc_abc"), "abcQ?abc", match = NA)[1] │ <abcQabc>
[2] │ <abcabc>
[3] │ abc_abc
grep("p{2}", vec_char1, value = TRUE)[1] "pineapple" "apple"
stringr::str_view(vec_char1, "p{2}", match = NA) [1] │ pinea<pp>le
[2] │ strawberry
[3] │ blackberry
[4] │ a<pp>le
[5] │ banana
[6] │ grapefruit
[7] │ melon
[8] │ cranberry
[9] │ kiwi
[10] │ lemon
If you have longer character vectors that include various lines, you can identify every new line or a tab using:
An expression between parentheses () forms a group. This allows you e.g. to apply a quantifiers to that group. For instance, let’s use the pattern “(na)+” to find matches in vec_char1:
grep("(na)+", vec_char1, value = TRUE)[1] "banana"
stringr::str_view(vec_char1, "(na)+", match = NA) [1] │ pineapple
[2] │ strawberry
[3] │ blackberry
[4] │ apple
[5] │ ba<nana>
[6] │ grapefruit
[7] │ melon
[8] │ cranberry
[9] │ kiwi
[10] │ lemon
As you can see there is one match:
Note that you can store regular expressions in an object. For instance,
pat_1 <- "abc|def"stores a regular expression you can re-use:
grep(pat_1, c("abc", "def", "ghi"), value = TRUE)[1] "abc" "def"
This allows you to generate patterns from code.
Let’s now use these regular expressions to extract characters from a character vector. Let’s first start with
vec_char2 <- c("usd 25", "eur 35", "USD 36", "EUR 88", "Usd 4700", "Eur 18723", "$25522", "€140")Here, you can see that all strings in the character vector refer to a currency, the usd or eur, but that these references are written in multiple ways. To work with the numbers, we need to extract the currency and the currency and store each is a separate variable. Let’s stick to regular expressions (you could e.g. tolower() to change of uppercase currency in lowercase and gsub() to replace all occurrences of \$ and € with “usd” or “eur”. Using {stringr}’s str_extract_all() to extract the currencies.
stringr::str_extract_all(vec_char2, "[A-Za-z]{3}|€|\\$")[[1]]
[1] "usd"
[[2]]
[1] "eur"
[[3]]
[1] "USD"
[[4]]
[1] "EUR"
[[5]]
[1] "Usd"
[[6]]
[1] "Eur"
[[7]]
[1] "$"
[[8]]
[1] "€"
Now use the same function to extract the numbers.
stringr::str_extract_all(vec_char2, "\\d+")[[1]]
[1] "25"
[[2]]
[1] "35"
[[3]]
[1] "36"
[[4]]
[1] "88"
[[5]]
[1] "4700"
[[6]]
[1] "18723"
[[7]]
[1] "25522"
[[8]]
[1] "140"
str_extract_all() returns a list. You can access the elements of that list using the subsetting operators for a list. For instance, to show the value for the second outcome and return a numeric value, you would use:
outcome <- stringr::str_extract_all(vec_char2, "\\d+")
as.numeric(outcome[[2]][1])[1] 35
As an alternative, you can simplify these results. To do so, you add simplify = TRUE as an argument to the str_extract_all() function
outcomes <- stringr::str_extract_all(vec_char2, "\\d+", simplify = TRUE)
outcomes |> as.numeric()[1] 25 35 36 88 4700 18723 25522 140
You can now subset these results using the usual subsetting operators.
Suppose that student numbers are written as “r2024-000125-B”. Here the pattern is “lowercase r; followed by academic year; followed by -; followed by 6 digits; followed by - and ends with a uppercase which can be any uppercase letter”. Write a regular expression that identifies these numbers in
char_stud <- c("r2024-000125-B", "r2024-005524-L", "r2024-00014-5", "r2024-1000140-C")Note that only the first two are correct.
grep("r\\d{4}-\\d{6}-[A-Z]", char_stud, value = TRUE)[1] "r2024-000125-B" "r2024-005524-L"
How would you change this regular expression if the part in the middle could be 6 or 7 digits? If that is the case, in addition to the first two, the last number should also match.
grep("r\\d{4}-\\d{6,7}-[A-Z]", char_stud, value = TRUE)[1] "r2024-000125-B" "r2024-005524-L" "r2024-1000140-C"
Recall that dates are written as “yyyy-mm-dd”. Write a regular expression that actual dates in the following character vector.
vec_dat <- c("2025-03-20", "2025-03-08", "1998-11-11", "2025-24-33", "2025-19-54")Note that only the first 3 are correct.
grep("\\d{4}-[0-1][0-9]-[0-3][0-9]", vec_dat, value = TRUE)[1] "2025-03-20" "2025-03-08" "1998-11-11"
Here you have some sentences. Using {stringr}’s str_count() the number of times the letters “the” occur in words but excluding the word “the” (e.g. thesis, these, they)
vec_quote <-c("The thesis was written by 2 students.",
"These students were in the same group for mathematics.",
"The first part of their work included their theory.",
"They had to apply statistics to test their hypothesis.",
"These tests were done in R.",
"To collect their data, they had to visit a theater.")First write the regular expression to match these words:
stringr::str_view(vec_quote, "(([A-Za-z]?)+(T|t)he)[a-z]+", match = NA)[1] │ The <thesis> was written by 2 students.
[2] │ <These> students were in the same group for <mathematics>.
[3] │ The first part of <their> work included <their> <theory>.
[4] │ <They> had to apply statistics to test <their> <hypothesis>.
[5] │ <These> tests were done in R.
[6] │ To collect <their> data, <they> had to visit a <theater>.
Use str_count() to count the number of matches per sentence:
stringr::str_count(vec_quote, "(([A-Za-z]?)+(T|t)he)[a-z]+")[1] 1 2 3 3 1 3
# Check the words: the is included in thesis, these, mathematics, their, theory
# hypothesis, they and theater. T can be both uppercase and lowercase
# ((T|t)he): a group of letters allowing for The as well as the
# part before this group: ([a-z]?)+: optional number of upper or lowercase
# letters: upper of lowercase: [A-Za-z], optional: ? can be repeated as a group
# part after ((T|t)he): any series of letters, lowercase: [a-z]+Here, str_count shows the result in a vector
Let’s return to vec_char3
vec_char3 [1] "25-78T" "25-98S" "45-97Q" "45-74Q" "45-72T" "55-10T" "55-48T"
[8] "55-69Q" "55+178T" "173+8W" "235+9W" "125+1W" "274+2Q" "274+5Q"
[15] "751+9Q" "274+1W" "274+4W" "274+5Q" "751+6Q"
and try to write a regular expression that matches all product codes that start with 55 and end with T. To define the start, we can use the carat sign: “^55”. As you can see from the product codes, “55” is followed by other characters. Sometimes it 3 “e.g. ”-10” in another occasion it is 4 “+178”. To allows for this repetition of any sign, we will use \.: any character and allow for one or more repetitions. The last part includes the “T”. In a regular expression, that is “T$”. With the regular expression, we can now extract the product codes:
grep(pattern = "^55.+T$", vec_char3, value = TRUE)[1] "55-10T" "55-48T" "55+178T"
stringr::str_view(vec_char3, "^55.+T$", match = NA) [1] │ 25-78T
[2] │ 25-98S
[3] │ 45-97Q
[4] │ 45-74Q
[5] │ 45-72T
[6] │ <55-10T>
[7] │ <55-48T>
[8] │ 55-69Q
[9] │ <55+178T>
[10] │ 173+8W
[11] │ 235+9W
[12] │ 125+1W
[13] │ 274+2Q
[14] │ 274+5Q
[15] │ 751+9Q
[16] │ 274+1W
[17] │ 274+4W
[18] │ 274+5Q
[19] │ 751+6Q
Using the following vector, you will have to write regular expression to extract elements of that vector. You can use stringr::str_view(vec, pattern) to see if your regular expression is successful in matching the required outcome. In the folded code, this function is included to show the pattern matches. The folded code also assigns the patters to pat to use in the function calls.
vec <- c("+32 123 456789", "0032 123 456798", "+32 012345679", "rqx_47-87+5", "rqx_47-87+6", "rqx_47-86+5", "rpts_47-86+5", "usd 25", "eur 36")grep() extract all location where you can find a cell phone number. This numbers starts with +32 or 0032 and is followed by 3 digit a space and 6 digits. Some people forget to include that second space and add 9 digits after 32. In vec, the first three elements are correct numbers, the others aren’t.pat <- ".+32\\s[0-9]{3}\\s?[0-9]{6}"
stringr::str_view(vec, pat, match = NA)[1] │ <+32 123 456789>
[2] │ <0032 123 456798>
[3] │ <+32 012345679>
[4] │ rqx_47-87+5
[5] │ rqx_47-87+6
[6] │ rqx_47-86+5
[7] │ rpts_47-86+5
[8] │ usd 25
[9] │ eur 36
grep(pat, vec, value = TRUE)[1] "+32 123 456789" "0032 123 456798" "+32 012345679"
vec and store the result in vec_phone:vec_phone <- vec[grepl(pat, vec)]
vec_phone[1] "+32 123 456789" "0032 123 456798" "+32 012345679"
vec that include a currency. Write your code in such a way that “yen”, “gbp” and “sek” would also be extracted if included.pat <- "usd|eur|yen|gdp|sec"
stringr::str_view(vec, pat, match = NA)[1] │ +32 123 456789
[2] │ 0032 123 456798
[3] │ +32 012345679
[4] │ rqx_47-87+5
[5] │ rqx_47-87+6
[6] │ rqx_47-86+5
[7] │ rpts_47-86+5
[8] │ <usd> 25
[9] │ <eur> 36
grep(pat, vec, value = TRUE)[1] "usd 25" "eur 36"
grep(pat, vec, value = TRUE) |> stringr::str_split(pattern = " ", simplify = TRUE) [,1] [,2]
[1,] "usd" "25"
[2,] "eur" "36"
Here, you have a matrix. You can extract the values using matrix subsetting operators. These will be introduced in this chapter.
vec and extract all values that include include “47” after the initial letters and end with “+5”. Use str_extract_all() and simplify the results. Write your code using the pipe operatorpat <- "([a-z]+)?_47.+\\+5"
stringr::str_view(vec, pat, match = NA)[1] │ +32 123 456789
[2] │ 0032 123 456798
[3] │ +32 012345679
[4] │ <rqx_47-87+5>
[5] │ rqx_47-87+6
[6] │ <rqx_47-86+5>
[7] │ <rpts_47-86+5>
[8] │ usd 25
[9] │ eur 36
vec |> stringr::str_extract_all(pat, simplify = TRUE) [,1]
[1,] ""
[2,] ""
[3,] ""
[4,] "rqx_47-87+5"
[5,] ""
[6,] "rqx_47-86+5"
[7,] "rpts_47-86+5"
[8,] ""
[9,] ""
Using the following paragraph from a reuters article
reuters <- "The pound headed for its worst weekly performance against the euro in over two years on Friday, as a boost to European spending drove a broad rally in the single currency, while against the dollar, sterling rose ahead of U.S. jobs data. The euro has surged across the board this week, logging its best weekly performance against the dollar since March 2009. Against the pound, it was set for a weekly gain of 1.5%, the most since January 2023. It was last up 0.4% at 84.03 pence. The pound was up 0.4% against the dollar at $1.292."str_detect() to do so.pat <- "pound|Pound|sterling|Sterling"
stringr::str_view(reuters, pat, match = NA)[1] │ The <pound> headed for its worst weekly performance against the euro in over two years on Friday, as a boost to European spending drove a broad rally in the single currency, while against the dollar, <sterling> rose ahead of U.S. jobs data. The euro has surged across the board this week, logging its best weekly performance against the dollar since March 2009. Against the <pound>, it was set for a weekly gain of 1.5%, the most since January 2023. It was last up 0.4% at 84.03 pence. The <pound> was up 0.4% against the dollar at $1.292.
stringr::str_detect(reuters, pat)[1] TRUE
stringr::str_locate_all(reuters, pat)[[1]]
start end
[1,] 5 9
[2,] 199 206
[3,] 371 375
[4,] 485 489
stringr::str_count(reuters, pat)[1] 4
stringr::str_split(reuters, stringr::boundary("sentence"))[[1]]
[1] "The pound headed for its worst weekly performance against the euro in over two years on Friday, as a boost to European spending drove a broad rally in the single currency, while against the dollar, sterling rose ahead of U.S. jobs data. "
[2] "The euro has surged across the board this week, logging its best weekly performance against the dollar since March 2009. "
[3] "Against the pound, it was set for a weekly gain of 1.5%, the most since January 2023. "
[4] "It was last up 0.4% at 84.03 pence. "
[5] "The pound was up 0.4% against the dollar at $1.292."
stringr::str_split(reuters, stringr::boundary("sentence"))[[1]][4][1] "It was last up 0.4% at 84.03 pence. "
stringr::str_split(reuters, stringr::boundary("sentence"))[[1]][4] |>
stringr::str_split(stringr::boundary("word"))[[1]]
[1] "It" "was" "last" "up" "0.4" "at" "84.03" "pence"
Create a character variable with “var_1”, … “var_5” that you would use to add names to a vector. Save this vector in vec_names.
vec_names <- paste("var", 1:5, sep = "_")
vec_names[1] "var_1" "var_2" "var_3" "var_4" "var_5"
What would happen is you use collapse = "_" and not sep = "_"?
vec_namesc <- paste("var", 1:5, collapse = "_")
vec_namesc[1] "var 1_var 2_var 3_var 4_var 5"
Factors are a special vector and are used to represent categorical variables. Categorical variables can take a limited number of known values (often referred to as levels). Examples of categorical variables include nominal variables and ordinal variables. The first, nominal variables, have two or more categories but these have no intrinsic ordering. In other words, you can not take one value of a nominal variables and say that it is higher, lower, bigger, smaller … than another value. Hair color, the name of a city, country or continent, a yes/no reply in a questionnaire or the name of a month are examples of nominal variables. You can order them alphabetically, or, for months, as they appear in a year, but any other ordering wouldn’t affect they way you handle them. In other words, if you would recode city names as numeric variables (1 = Amsterdam, 2 = Brussels, 3 = Copenhagen, …) these numeric values wouldn’t have any meaning. Ordinal variables differ from nominal variables as they have an intrinsic ordering. Examples include e.g. educational experience (elementary school, high school, some college, bachelor’s degree, master’s degree, PhD) or price categories measured as “budget” or “premium”. If you would recode these variables as numeric variables, their level would matter. For instance, you would recode “elementary school” as “1”, “high school” as “2”, “some college” as “3”, “bachelor’s degree” as “4”, “master’s degree” as “5” and “PhD” as “6” or, for price categories, “budget” as “1” and “premium” as “2”. However, these categories are not equally spaced. In other words, the difference between the numeric values for “high school” and “elementary school” (2 - 1 = 1) isn’t the same as the difference between “PhD” and “master’s degree” (6 - 5 = 1). In other words, the categories are not equally spaced.
In addition to base R factor function, {forcats} - a package included in the {tidyverse} - includes a lot of functions to manipulate factor variables. As we did with {stringr} and {lubridate} function, I’ll include forcats:: at the start of a function if that function is part of that package. If forcats:: is not part of the function call, the function is a base R function. Recall that all {stringr} functions start with str_. In a similar way, all {forcats} function start with fct_ and (most) are follewed by a verb.
Suppose that you have a variable that records months:
vec_month1 <- c("Sep", "Aug", "Oct", "Jan", "Nov", "Mar", "Dec", "Apr", "Jun", "May", "Feb", "Jul" )Recall from Chapter 2, that these months don’t sort in a meaningful way:
sort(vec_month1) [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"
To fix this, we can create a factor and include a vector of valid levels. These levels are ordered in a meaningful. For instance:
vec_month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")We can not encode vec_month1 as a factor, including the levels using factor(x = , levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA). Here, x is the vector with the data you want to encode as a factor. The levels are optional, and are included in a vector of unique values that x might take and are included as character. R assumes by default that his vector with levels is sorted in increasing order of x. If these levels are not included, R uses sort(unique(x)) to set levels. The labels = levels allows you to add labels to the levels. These labels allow you to include more descriptive term for every level. This is especially useful if the levels are recorded as numeric. By default, R sets these labels equal to the levels. You can exclude some values. In the case, you include a vector with the values to exclude after exclude =. By default, all unique values in x are treated as a separate factor. For instance, if your data in x includes missing values, exclude = NULL will treat these missing values as a separate level. By default that level is the last level. is.ordered(x) is by default FALSE. If that is set to TRUE, R will treat the factors as ordinal variables. The last argument allows you to restrict the number of factors if x includes a lot of unique values.
Let’s see what these options do. First, let’s accept all default values:
vec_fac1 <- factor(x = vec_month1)
vec_fac1 [1] Sep Aug Oct Jan Nov Mar Dec Apr Jun May Feb Jul
Levels: Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
The output shows vec_month1 first and all levels next. As the command didn’t include levels, R ordered the levels using sort(unique()):
sort(unique(vec_month1)) [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"
If you add levels, R will change the order and follow the order in the levels argument.
vec_fac1 <- factor(x = vec_month1, levels = vec_month_levels)
vec_fac1 [1] Sep Aug Oct Jan Nov Mar Dec Apr Jun May Feb Jul
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Now, the months are ordered as they were in vec_month_levels. Note the levels that do not occur in the x are dropped.
Adding labels allows you to add more descriptive terms. For the months, these labels could be the months written in full:
vec_month_labels <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")Using these labels:
vec_fac1 <- factor(x = vec_month1, levels = vec_month_levels, labels = vec_month_labels)
vec_fac1 [1] September August October January November March December
[8] April June May February July
12 Levels: January February March April May June July August ... December
All months are now written in full. Note that using these labels to set the levels is not possible. R searches tries to match every value in the vector it has to encode as factor with a level in the levels vector. In other words, if R encounters “Jan” in the vector it had to encode, it searches for “Jan” in the levels vector. If that value is missing as a level, is will report NA in the vector it had to encode. Here R silently converts any values in the vector to encode that it doesn’t find in the levels vector into NA.
The {forcats}’s fct(x, levels = NULL, na = character()) function allows to create a vector, but avoids that missing values are silently encoded as NA. The first argument is the vector to encode as factor, the second the vector used for levels (NULL or none by default) and the third optional argument allows you to include the values in x that fct() should treat as NA. Let’s first show how you can use this function:
vec_fac2 <- forcats::fct(x = c("Apr", "Feb", "Jan"), levels = c("Jan", "Feb", "Apr"))
vec_fac2[1] Apr Feb Jan
Levels: Jan Feb Apr
New let’s add a typo and write “Apr” as “Arp”.
vec_fac3 <- forcats::fct(x = c("Arp", "Feb", "Jan"), levels = c("Jan", "Feb", "Apr"))Error in `forcats::fct()`:
! All values of `x` must appear in `levels` or `na`
ℹ Missing level: "Arp"
vec_fac3Error: object 'vec_fac3' not found
Here, R produces an error: ! All values of x must appear in levels or na. Missing level: "Arp". As you can see from the error, I warns that a value in the x vector was not included in the levels vector. If also shows its value: “Arp”. Using base R’s factor would add an NA without warning:
vec_fac3 <- factor(c("Arp", "Feb", "Jan"), levels = c("Jan", "Feb", "Apr"))
vec_fac3[1] <NA> Feb Jan
Levels: Jan Feb Apr
Here, base R changes “Arp” into “NA”. If the typo is undetected, which is likely in large datasets, this would affect your analysis. In {forcats} you need to include “Arp” in the na = argument if you want to avoid an error. In other words, you have to instruct R to treat “Arp” as a missing value:
vec_fac3 <- forcats::fct(x = c("Arp", "Feb", "Jan"), levels = c("Jan", "Feb", "Apr"), na = c("Arp"))
vec_fac3[1] <NA> Feb Jan
Levels: Jan Feb Apr
You can see from the output that “Arp” is now indeed recored as a missing value. This is how you instructed R to treat “Arp”. There is a second difference between both functions. Base R’s factor() orders using sort(unique(x)) in case the levels argument is missing. {forcats} fct() orders by first appearance. In other words, it uses the character vector to encode an including an implicit order.
Factors can include numeric values. Suppose you have a yes/no reply to an answer where “No” recorded as 0 and “Yes” as 1. You could encode that vector as a factor using:
vec_fac2 <- factor(x = c(1, 1, 1, 0, 0, 1, 0), levels = c(0, 1), labels = c("No", "Yes"))
vec_fac2[1] Yes Yes Yes No No Yes No
Levels: No Yes
Note that {forcats} needs a character vector to encode as factor. Including a numeric factor causes an error.
forcats::fct(x = c(1, 1, 1, 0, 0, 1, 0), levels = c(0, 1))Error in `forcats::fct()`:
! `x` must be a character vector, not a double vector.
To created an ordered factor, you need to change ordered = is.ordered(x) to ordered = TRUE. Doing so creates an ordinal factor. Suppose you have income levels from a survey recorded as low = 1, medium = 2 and high = 3. Creating an ordered factor:
vec_ord1 <- factor(c(1, 2, 3, 3, 1, 1, 2, 2, 1), levels = c(1, 2, 3), labels = c("Low income", "Medium income", "High income"), ordered = TRUE)
vec_ord1[1] Low income Medium income High income High income Low income
[6] Low income Medium income Medium income Low income
Levels: Low income < Medium income < High income
Here you see that the output shows the levels as well as their ordering: low income is lower than medium and medium income is lower than high income.
You can check if a vector is a factor using is.factor() and if it is an ordered factor using is.ordered().
is.factor(vec_ord1)[1] TRUE
is.ordered(vec_ord1)[1] TRUE
You can coerce a vector into a factor or ordered factor using as.factor() or as.ordered(). Using c(1, 2, 1, 3, 2, 1) as an example
as.factor(c(1, 2, 1, 3, 2, 1))[1] 1 2 1 3 2 1
Levels: 1 2 3
as.ordered(c(1, 2, 1, 3, 2, 1))[1] 1 2 1 3 2 1
Levels: 1 < 2 < 3
you can see that both functions create a factor. The ordered factor is created from sort(unique(x)). In other words, as.ordered() assumes that the values in the vector to encode as factor are listed in the correct order.
In plots, it is often useful to reorder factor levels. For instance, if you would plot the population of a city where cities are encoded as factors and are alphabetically ordered, that plot would show these cities on the horizontal or vertical axis in that order. To produce a nice plot, it might be more convenient to have these cities ordered in terms of their population. In that way, the smallest city would show up on the left of the horizontal axis and the largest city on the right. To do so, you can use {forcats}’ fct_reorder() function. This function’s first argument is the factor to reorder. The second argument is the variable that R needs to use to reorder. The third argument, fun = median shows the summary function R uses to reorder. For each factor level, R calculates the value of the function in fun and uses this value to reorder the factors. Using the default na_rm = NULL R removes missing values with a warning. Changing that into TRUE will cause R to remove them without a warning and FALSE preserves the NA’s. By default, R orders descending. Adding desc = FALSE changes this default.
{forcats} fct_recode() and fct_collapse() allow to modify the factor levels. fct_recode() allows you to recode factor levels. To do so, you need to use fct_recode(x, "new value" = "old value") where x is the factor and the statement “new value” = “old value” is entered every old factor level that you need to change. If an “old value” is not included, R assumes that it remains as is. To illustrate, we first create a factor:
vec_fac1 <- factor(x = c(1, 2, 3, 4), levels = c(1, 2, 3, 4), labels = c("small city", "large city", "small town", "large town"))
vec_fac1[1] small city large city small town large town
Levels: small city large city small town large town
Let’s now recode to show levels “city, small”, “city, large”, “town, small” and “town, large”:
forcats::fct_recode(vec_fac1,
"city, small" = "small city",
"city, large" = "large city",
"town, small" = "small town",
"town, large" = "large town")[1] city, small city, large town, small town, large
Levels: city, small city, large town, small town, large
Note that you can use this function to reduce the number of levels. For instance, if you want to drop the difference between “small” and “large” and only keep “city” and “town”, you can recode all “small city” and “large city” to “city”:
forcats::fct_recode(vec_fac1,
"city" = "small city",
"city" = "large city",
"town" = "small town",
"town" = "large town")[1] city city town town
Levels: city town
Note that here you will loose these 4 factor levels if you recode and assign to the same factor.
{forcats}’ fct_collapse() function performs a similar task. It allows you to collaps various factor levels. The function’s argument is similar to those for recode. For instance, suppose you want to recode “small” and “large” in one level and you would use fct_collapse():
forcats::fct_collapse(vec_fac1,
"city" = c("small city", "large city"),
"town" = c("small town", "large town"))[1] city city town town
Levels: city town
Here, you use city" = c("small city", "large city") to collapse the levels on the right hand side of the equality sign into the level on the left hand side.
Suppose that you have a variable with the values “Asi”, “Afr”, “Eur”, “Ame”, “Oce”. These values stand for “Asia”, “Africa”, “Europe”, “Americas” and “Oceania”. Create a factor that will show these contintents in alfabetical order and add labels. Use cont to store this variable.
cont <- factor(c("Asi", "Afr", "Eur", "Ame", "Oce"), levels= c("Afr", "Ame", "Asi", "Eur", "Oce"), labels = c("Africa", "Americas", "Asia", "Europe", "Oceania"))
cont[1] Asia Africa Europe Americas Oceania
Levels: Africa Americas Asia Europe Oceania
Is cont a factor?
is.ordered(cont)[1] FALSE
Is cont an ordered factor?
is.ordered(cont)[1] FALSE
To measure an individual’s education, the following values are used in your dataset: “some high school”, “high school”, “some college”, “bachelor”, “master”, “PhD”. These values are including using numbers: 1 (some high school), 2 (high school), … 6 (PhD). var_school shows such a variable.
var_school <- sample(1:6, 20, replace = TRUE)Created an ordered factor including labels. Assign this factor to school.
school <- factor(var_school, levels= c(1, 2, 3, 4, 5, 6), labels = c("some high school", "high school", "some college", "bachelor", "master", "PhD"), ordered = TRUE)
school [1] some college bachelor bachelor PhD
[5] some high school bachelor master master
[9] master PhD some high school master
[13] PhD high school master some college
[17] master some high school PhD master
6 Levels: some high school < high school < some college < ... < PhD
Use {forcats} to recode school and reduce the number of levels by merging “high school” and “some college” into “secondary” and merging “bachelor” and “master” into “tertiary”. There are two ways to do this:
forcats::fct_recode(school,
"secondary" = "high school",
"secondary" = "some college",
"tertiary" = "bachelor",
"tertiary" = "master") [1] secondary tertiary tertiary PhD
[5] some high school tertiary tertiary tertiary
[9] tertiary PhD some high school tertiary
[13] PhD secondary tertiary secondary
[17] tertiary some high school PhD tertiary
Levels: some high school < secondary < tertiary < PhD
forcats::fct_collapse(school,
"secondary" = c("high school", "some college"),
"tertiary" = c("bachelor", "master")) [1] secondary tertiary tertiary PhD
[5] some high school tertiary tertiary tertiary
[9] tertiary PhD some high school tertiary
[13] PhD secondary tertiary secondary
[17] tertiary some high school PhD tertiary
Levels: some high school < secondary < tertiary < PhD
Matrices are two-dimensional object that allow to store data in rows and columns. Recall that vectors stored data in one row and one or more columns. Like vectors, matrices are homogeneous. In other words, they store numeric or character or boolean or data/time values but not a combination of two or more datatypes. Most of what we discusses for vectors also applies to matrices. As a matter of fact, you can think of a vector as a special case of a matrix: it is a matrix with one row and one or more columns. However, if you want to use the vector as a matrix, you need to create a matrix with 1 row and n columns.
An “mxn” matrix has m rows and n columns. In general, the value on the ith row and jth column is referred to as matrix-name(i,j). We’ll see in the next section how you subset a matrix. A matrix with the same number of rows as there are columns, i.e. an nxn matrix is also called a square matrix. Here we will focus on numeric matrices. However, as long as all elements in a matrix are the same, a matric would also include characters, logical values, integers or data/time variables. As you will see here, most of what we learned for vectors also applies to matrices.
We will first show how to create a matrix in general. We then move to a couple of special matrices.
To create a matrix, you use the matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) function. The first argument is optional and allows you to add a vector with data to fill the matrix. The second and third argument, ncol = 1 and nrow = 1 determine the size of the matrix: the desired number of rows and columns. If you add a vector that R needs to use to fill the data, byrow = FALSE instructs R to fill the matrix by column. In other words, if you have 4 rows and 5 columns, R first fills all rows of the first column, the all rows of the second, … to end with all 4 rows of the 5th column. Changing this default into TRUE tells R to first fill the rows. In other words, R will now first fill the 5 columns of the first row, then move to the second row and fill all columns in that row, … . The last argument allow you to add a name to the row and column dimensions. Let’s use a 2x3 matrix mat_0:
mat_0 <- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)
mat_0 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
As you can see, R creates a matrix with two rows [1, ] and [2, ] and three columns [,1], [,2], [,3]. R added the values in c() by column (default: byrow = FALSE). It took the first two values in c() (1 and 2) and used these to fill the first column. The next two values, 3 and 4, were added to the second column. The last two values in c() are shown in the last column.
The attributes of mat_0 include the dimenions of the matrix: the number of rows and the number of columns:
attributes(mat_0)$dim
[1] 2 3
To create mat_0 we included its elements via c(). The argument data can include vectors or function. For instance, let’s use a 1x6 vector vec_1 to illustrate this. We’ll fill vec_1 with a sequence of 1 to 6 using the shorthand for seq(from = 1, to = 6, by = 1):
vec_1 <- 1:6We can know create a matrix mat_1 (accepting the default values for byrow and dimnames):
mat_1 <- matrix(vec_1, nrow = 2, ncol = 3)
mat_1 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
As you can see, R filled this matrix in the same way as R did for mat_0. Note that we didn’t have to create a separate vector vec_1. We could have included 1:6 as the first argument.
In both examples, the length of c() or vec_1 was equal to the number of cells in the matrix: the number of cells equals nrow * ncol = 6 and the length of vec_1 or c() was also 6. If that is not the case, R reports an warning. If the number of cells in the matrix is larger than the length of the vector, R will use some of all values more than once. Suppose that you have a vector with 4 columns that needs to fill a matrix with 2 rows and 3 columns:
matrix(1:4, nrow = 2, ncol = 3)Warning in matrix(1:4, nrow = 2, ncol = 3): data length [4] is not a
sub-multiple or multiple of the number of columns [3]
[,1] [,2] [,3]
[1,] 1 3 1
[2,] 2 4 2
R will use the first two observations of the vector two times. After having used all 4 columns of 1:4 to fill the first two columns, R uses the same vector again to fill the other cells. In this example, R used 1 and 2 of the sequence 1:4 twice. Note that R shows a warning that the dimensions of the vector and matrix didn’t fit.
If the length of the vector is longer than the number of cells in the matrix,
matrix(1:9, 2, 3)Warning in matrix(1:9, 2, 3): data length [9] is not a sub-multiple or multiple
of the number of rows [2]
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
R uses only the first nrow * ncol columns of the vector. Here R used the first 6 columns of the vector 1:9 to fill the matrix, and dropped the others. Again, R shows a warning message that the dimensions didn’t fit.
You can also create a matrix using the dim() function to a vector. The next examples shows how this works:
vec_1 <- 1:6
dim(vec_1) = c(2, 3)
vec_1 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
You can use all others functions that generate a vector to fill a matrix. Examples include:
matrix(rnorm(6), nrow = 2, ncol = 3) [,1] [,2] [,3]
[1,] 0.02905913 1.288859 -0.1529046
[2,] 0.46637676 -0.674857 0.2705640
matrix(letters[1:6], nrow = 2, ncol = 3) [,1] [,2] [,3]
[1,] "a" "c" "e"
[2,] "b" "d" "f"
matrix(sample(1:1000, 6), nrow = 2, ncol = 3) [,1] [,2] [,3]
[1,] 817 426 412
[2,] 573 530 550
1:12 and 7:18:matrix(base::intersect(1:12, 7:18), nrow = 2, ncol = 3) [,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
Let’s now see what byrow = TRUE changes to the outcome of matrix():
mat_1 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
mat_1 [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Recall that with the default for byrow, R first filled the first column, then the second and then the third. As you can see from the output now, R now filled the first row first putting the first value of vec_1 in the first column, the second value in the second and the third value in the third column. As all columns in the first row were filled, R moved to the second row and used the fourth value of vec_1 to fill the first column on the second row, the fifth value to fill the second column on the second row and the sixth value in the third column on the second row.
The last option allows you to specify row and column names. To do so, you need to collect them in a list. We’ll see shortly that lists are yet another data structure in R. Note that there are other ways to set column and row names.
mat_1 <- matrix(vec_1, nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("row1", "row2"), c("var1", "var2", "var3")))
mat_1 var1 var2 var3
row1 1 2 3
row2 4 5 6
The output now shows the row and column names. These names are added to the attributes of the matrix as dimnames[[1]] for the rows and dimnames[[2]] for the columns:
attributes(mat_1)$dim
[1] 2 3
$dimnames
$dimnames[[1]]
[1] "row1" "row2"
$dimnames[[2]]
[1] "var1" "var2" "var3"
As alternative to add row and column names are the functions rownames() and colnames(). Let’s first recreate mat_1 without names:
mat_1 <- matrix(1:6, nrow = 2, ncol = 3)You can use colnames() in two ways. The argument of this function is a vector with column names. Suppose you want to add the following names to the columns of mat_1: c("var1", "var2", "var3"). The first way to do so is to use
colnames(mat_1) <- c("var1", "var2", "var3")
mat_1 var1 var2 var3
[1,] 1 3 5
[2,] 2 4 6
To add rownames, you can use rownames(). This function requires a vector with the names: c("row1", "row2"). To add these names:
rownames(mat_1) <- c("row1", "row2")
mat_1 var1 var2 var3
row1 1 3 5
row2 2 4 6
If you only want to use names for column and rows that include a prefix and a row or column number, e.g. “col1” or “row1 then there is a shortcut where you don’t have to type all row or column names. Using colnames(x, do.NULL = TRUE, prefix = "col") you can specify the matrix in x and the prefix in prefix = "col". The argument do.NULL is by default TRUE. This default do.NULL = TRUE adds no names. Changing that into FALSE tells R to add names.
colnames(mat_1) <- colnames(mat_1, do.NULL = FALSE, prefix = "var_")
mat_1 var1 var2 var3
row1 1 3 5
row2 2 4 6
You can do the same for the rows and add names using a prefix, e.g. “obs” and the row number:
rownames(mat_1) <- rownames(mat_1, do.NULL = FALSE, prefix = "obs_")
mat_1 var1 var2 var3
row1 1 3 5
row2 2 4 6
Note that there are many ways you can use the character functions to automate the process of naming rows and columns. As an illustration, let’s rewrite
rownames(mat_1) <- c("row1", "row2")
mat_1 var1 var2 var3
row1 1 3 5
row2 2 4 6
using the paste0() function:
rownames(mat_1) <- paste0("row", 1:2)
mat_1 var1 var2 var3
row1 1 3 5
row2 2 4 6
and
colnames(mat_1) <- c("var1", "var2", "var3")
mat_1 var1 var2 var3
row1 1 3 5
row2 2 4 6
using the paste() function:
colnames(mat_1) <- paste(1:3, c("st", "nd", "rd"), sep="")
mat_1 1st 2nd 3rd
row1 1 3 5
row2 2 4 6
If your matrix has column or row names, you can show these using the same colnames() or rownames() function. For instance:
colnames(mat_1)[1] "1st" "2nd" "3rd"
rownames(mat_1)[1] "row1" "row2"
Recall that vectors are data structures with 1 row and one or more columns. If you need to work with matrix algebra and use vectors, it is best to create a vector explicitly as a matrix. To do so, you need a matrix with 1 row and e.g. 3 columns:
mat_vec <- matrix(1:3, 1, 3)There are a couple of special matrices. Using the matrix function, we can create a mxn matrix with one constant value.
mat_2 <- matrix(5, nrow = 2, ncol = 3)
mat_2 [,1] [,2] [,3]
[1,] 5 5 5
[2,] 5 5 5
Here there are two special cases: a square matrix filled with ones;
J <- matrix(1, nrow = 3, ncol = 3)
J [,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
and the zero matrix: a mxn matrix with zero’s:
zeros <- matrix(0, nrow = 2, ncol = 3)
zeros [,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
A diagonal matrix is a square matrix where all values are equal to zero except those on the diagonal:
diag(c(10, 11, 12), nrow = 3, ncol = 3) [,1] [,2] [,3]
[1,] 10 0 0
[2,] 0 11 0
[3,] 0 0 12
A special case of this diagonal matrix is the identity matrix: a diagonal matrix whose diagonal elements are equal to 1:
ident <- diag(1, nrow = 3, ncol = 3)
ident [,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
The last special case, is a vector (a 1xn matrix), whose elements are all 1:
vec_ones <- matrix(1, 1, 3)
vec_ones [,1] [,2] [,3]
[1,] 1 1 1
Triangular matrices are square matrices where all elements below the diagonal are 0 (upper triangular) or all elements above the diagonal are 0 (lower triangular). This, in addition to these elements, the elements on the diagonal are also 0, the square matrix is strict triangular. The functions upper.tri(x, diag = FALSE) and lower.tri(x, diag = FALSE) can be used to create those matrices. These function return a logical matrix whose elements are TRUE if it above the diagonal (upper, with diag = FALSE) and FALSE is this is not the case. With diag = TRUE, the logical values on the diagonal will also be TRUE. The interpretation for the lower triangular function are identical, with the exception that TRUE in this case is for elements below or below and on the diagonal. To see how these function work, we’ll use:
mat_1 <- matrix(1:25, 5, 5)Using upper.tri() as an example to show the logical matrix:
upper.tri(mat_1, diag = FALSE) [,1] [,2] [,3] [,4] [,5]
[1,] FALSE TRUE TRUE TRUE TRUE
[2,] FALSE FALSE TRUE TRUE TRUE
[3,] FALSE FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE TRUE
[5,] FALSE FALSE FALSE FALSE FALSE
We can now use this logical matrix to change mat_1 into a lower triangular matrix whose elements on the diagonal differ from 0:
mat_1[upper.tri(mat_1, diag = FALSE)] <- 0
mat_1 [,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 2 7 0 0 0
[3,] 3 8 13 0 0
[4,] 4 9 14 19 0
[5,] 5 10 15 20 25
Note that here, we use the function upper.tri() to create a lower triangular matrix. To create a strict lower diagonal matrix, you can change the default value diag = FALSE in diag = TRUE. Doing so allows you to create a strict lower triangular matrix:
mat_1 <- matrix(1:25, 5, 5)
mat_1[upper.tri(mat_1, diag = TRUE)] <- 0
mat_1 [,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 2 0 0 0 0
[3,] 3 8 0 0 0
[4,] 4 9 14 0 0
[5,] 5 10 15 20 0
For R, a vector is not a matrix. You can see that if you ask R what class vec_1 is and compare that result with the class of mat_1:
class(vec_1)[1] "matrix" "array"
class(mat_1)[1] "matrix" "array"
The as.matrix(x) function tries to turn the object x into a matrix. Doing so, as.matrix() keeps the dimensions of x. In other words, it will change the vector vec_1 into a matrix with 6 rows and 1 column.
mat_1 <- as.matrix(vec_1)
mat_1 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
You can change a data frame into a matrix in a similar way. Recall from Chapter 1 that R includes a dataset mtcars. A data frame is a data structure we will discuss shortly:
class(mtcars)[1] "data.frame"
Recall that this data frame had 32 observations for 11 variables. This data frame includes variable names and identifies every observation:
head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
With as.matrix() you can change this data frame into a matrix:
mat_mtcars <- as.matrix(mtcars)This matrix has column and row names.
colnames(mat_mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
rownames(mat_mtcars) [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
Recall that a matrix is homogeneous: all elements must be of the same type. Often data frames are heterogeneous: they include numeric, character, data/time, boolean of factor variables. Using the data.matrix() function, R will change this data frame in a numeric matrix by converting all variables to numeric first. For instance, suppose that you have a data.frame df:
df <- data.frame(A = 1:3, B = letters[1:3], C = seq.Date(as.Date("2025-03-25"), by = "day", length.out = 3))
df A B C
1 1 a 2025-03-25
2 2 b 2025-03-26
3 3 c 2025-03-27
If you would use as.matrix() R would convert this data frame into a character matrix:
mat_df <- as.matrix(df)
typeof(mat_df)[1] "character"
Using data.matrix() avoids this:
mat_df <- data.matrix(df)
mat_df A B C
[1,] 1 1 20172
[2,] 2 2 20173
[3,] 3 3 20174
As you can see, the second column, B, has been changed to numeric. R changed the value of “a” into 1, “b” into 2, … . Here, all unique values are given a different numeric value. In addition, R used the fact that dates are numeric to change the date into numeric.
Here we first created the matrix using matrix(). However, it is also possible that you create a matrix witin your code. To check is an object is a matrix, you can use the is.matrix() function.
is.matrix(mat_1)[1] TRUE
If an object is a matrix, this function show TRUE. If this is not the case, the function shows FALSE. If an object is a matrix, you can check its type using typeof():
typeof(mat_1)[1] "integer"
Here, mat_1 is an integer matrix. In other words, its values are of type “integer”. Recall that this matrix was creates from the sequence 1:6. In other words, it was created as an integer value. The type of a matrix is determined is a similar way as a vector. To illustrate, let’s define 3 matrices
mat_n <- matrix(rnorm(6), 2, 3)
mat_c <- matrix(letters[1:6], 2, 3)
mat_d <- matrix(seq(as.Date("2025-03-25"), by = "day", length.out = 6), 2, 3)and check there type:
typeof(mat_n)[1] "double"
typeof(mat_c)[1] "character"
typeof(mat_d)[1] "double"
As you can see, R stores the dates in mat_d as numeric variables with class “Matrix” “Array”. Recall that dates are stored as numbers. In other words you can turn these numbers into dates using one of the function we have covered, e.g. as.Date() or {lubridate}’s ymd().
It is important to stress that matrices are homogenious in the sense that they can only include values of one type. For instance, the following line changes the value of the element on the second row and second column in mat_d in a character variable.
mat_n[2,2] <- "10000"
mat_n [,1] [,2] [,3]
[1,] "0.982499218575708" "0.42216077988595" "0.22435607224835"
[2,] "-1.01112994823057" "10000" "0.125027643014595"
The output suggest that R coerced all other elements into character variables. Indeed, the type of this matrix is
typeof(mat_n)[1] "character"
now a character matrix. We have seen similar behavior with vectors.
A matrix has some attributes that you will often use in code. To check the attributes of an object, you can use R’s attributes() function. To illustrate this function, we’ll use
mat_1 <- matrix(vec_1, nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("row1", "row2"), c("var1", "var2", "var3")))
mat_1 var1 var2 var3
row1 1 2 3
row2 4 5 6
The function shows which the attributes for the object:
att_mat1 <- attributes(mat_1)
att_mat1$dim
[1] 2 3
$dimnames
$dimnames[[1]]
[1] "row1" "row2"
$dimnames[[2]]
[1] "var1" "var2" "var3"
Here, you see the various attributes: the dimension $dim, the row names dimnames[1] and colum names dimnames[2]. There are multiple ways to access these attributes. For instance, the dimension of the matrix includes the number of rows (2) and the number of columns (3). To extract the number these dimensions you can use dim(). This function shows the number of rows and column in mat_1.
dim(mat_1)[1] 2 3
You can use these to extract the number of rows and columns. To do so, you assign the result of this function to an object.
dim_mat1 <- dim(mat_1)
dim_mat1[1] 2 3
You can now subset this result:
nobs <- dim_mat1[1]
nvar <- dim_mat1[2]These values now store the number of rows (nobs) and the number of columns (nvar).
If you are only interested in the number of rows or number of columns, you can extract these using nrow() or ncol():
nrow(mat_1)[1] 2
ncol(mat_1)[1] 3
Note that in many cases, the number of rows will be equal to the number of observations in your dataset while the number of column is equal to the number of variables.
To see the total number of values in the matrix, or the product of the number of rows and the number of columns, you can use length():
length(mat_1) [1] 6
nrow(mat_1) * ncol(mat_1)[1] 6
If you need the column or row names, you can use colnames() or rownames():
colnames(mat_1)[1] "var1" "var2" "var3"
rownames(mat_1)[1] "row1" "row2"
If you store these names, you can use them in your code.
Create a 3x3 matrix, mat_0 using a sequence from 21:29
mat_0 <- matrix(21:29, 3, 3)
mat_0 [,1] [,2] [,3]
[1,] 21 24 27
[2,] 22 25 28
[3,] 23 26 29
Using the same values, fill this matrix by row:
mat_0 <- matrix(21:29, 3, 3, byrow = TRUE)
mat_0 [,1] [,2] [,3]
[1,] 21 22 23
[2,] 24 25 26
[3,] 27 28 29
Store the numbers 21-29 in a vector mat_0 and create a matrix using the dim() function
mat_0 <- 21:29
dim(mat_0) <- c(3, 3)
mat_0 [,1] [,2] [,3]
[1,] 21 24 27
[2,] 22 25 28
[3,] 23 26 29
What happens if you use 1:4 to create a 2x3 matrix mat_0, filled by column? Predict the value in mat_0[2, 3].
mat_0 <- matrix(1:4, 2, 3)Warning in matrix(1:4, 2, 3): data length [4] is not a sub-multiple or multiple
of the number of columns [3]
mat_0[2, 3][1] 2
What is the value in mat_0[2, 3] you fill the matrix with 1:9
mat_0 <- matrix(1:9, 2, 3)Warning in matrix(1:9, 2, 3): data length [9] is not a sub-multiple or multiple
of the number of rows [2]
mat_0[2 ,3][1] 6
Create 3x3 a named matrix mat_0 with elements 21-29 with row names “obs_1”, “obs_2”, … and column names “var_1”, “var_2”
mat_0 <- matrix(21:29, 3, 3, dimnames = list(c("obs_1", "obs_2", "obs_3"), c("var_1", "var_2", "var_3")))
mat_0 var_1 var_2 var_3
obs_1 21 24 27
obs_2 22 25 28
obs_3 23 26 29
Here, you had to write down all names. First recreate mat_0 without names and then use the rownames and colnames function to set the names. Using these function, try to avoid writing all names. There are two ways to do so.
mat_0 <- matrix(21:29, 3, 3)
rownames(mat_0) <- paste("obs", 1:3, sep = "_")
colnames(mat_0) <- paste("var", 1:3, sep = "_")
mat_0 var_1 var_2 var_3
obs_1 21 24 27
obs_2 22 25 28
obs_3 23 26 29
mat_0 <- matrix(21:29, 3, 3)
rownames(mat_0) <- rownames(mat_0, do.NULL = FALSE, prefix = "obs_")
colnames(mat_0) <- colnames(mat_0, do.NULL = FALSE, prefix = "var_")
mat_0 var_1 var_2 var_3
obs_1 21 24 27
obs_2 22 25 28
obs_3 23 26 29
Create a 4x4 identity matrix ident
ident <- diag(1, 4, 4)
ident [,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
Determine the number of rows and columns for this matrix:
mat_0 <- matrix(rnorm(1000), 500, 2)
colnames(mat_0) <- c("var_1", "var_2")
rownames(mat_0) <- paste("obs", 1:500, sep = "_")nrow(mat_0)[1] 500
ncol(mat_0)[1] 2
attributes(mat_0)$dim[1][1] 500
attributes(mat_0)$dim[2][1] 2
Determine the type of mat_0:
typeof(mat_0)[1] "double"
Fill a 3x3 matrix, mat_1, with the first 9 letters of the alfabet, lowercase
mat_1 <- matrix(letters[1:9], 3, 3)
mat_1 [,1] [,2] [,3]
[1,] "a" "d" "g"
[2,] "b" "e" "h"
[3,] "c" "f" "i"
Fill a 3x3 matrix, mat_2 with a sequence of dates, starting 2025-04-01 and ending 2025-04-09.
mat_2 <- matrix(seq.Date(from = as.Date("2025-04-01"), to = as.Date("2025-04-09"), by = "days"), 3, 3)
mat_2 [,1] [,2] [,3]
[1,] 20179 20182 20185
[2,] 20180 20183 20186
[3,] 20181 20184 20187
Create a 3x3 boolean matrix, mat_3, using random sample from TRUE and FALSE
mat_3 <- matrix(sample(c(TRUE, FALSE), 9, replace = TRUE), 3, 3)
mat_3 [,1] [,2] [,3]
[1,] FALSE TRUE FALSE
[2,] TRUE FALSE FALSE
[3,] FALSE TRUE FALSE
Subsetting a matrix uses an approach which is very similar to the one used for a vector. However, with a matrix you have both rows as well as columns. This allows you to subset both individual elements, all rows on one of multiple columns, all columns on one or multiple rows or a range of elements spread over some columns and some rows. Matrix mat will be used to illustrate these approaches:
mat <- matrix(c(11, 21, 31, 41, 12, 22, 32, 42, 13, 23, 33, 34, 41, 42, 43, 44), nrow = 4, ncol = 4)
mat [,1] [,2] [,3] [,4]
[1,] 11 12 13 41
[2,] 21 22 23 42
[3,] 31 32 33 43
[4,] 41 42 34 44
As you can see, the elements of the matrix are equal to their row-column indices.
To subset an individual element, you can use mat[m, n] with m the row index and n the column index. For instance, extracting the element in the second row and the third column:
mat[2, 3][1] 23
If you assign the outcome to a new variable, you can use it in your code.
You can extract an entire column using mat[, n]. For instance, extracting the 4th column of mat:
mat[, 4][1] 41 42 43 44
Subsetting a specific row using a similar approach. To subset row m, you use mat[m, ]. For instance, subsetting the 3rd row of mat:
mat[3, ][1] 31 32 33 43
Note that R shows the simplest possible data structure. Subsetting a row or column, results in a numeric vector. To see this, let’s use is.vector() and ask for the class of mat[3, ]:
class(mat[3, ])[1] "numeric"
is.vector(mat[3, 1])[1] TRUE
To preserve the structure, you need to add drop = FALSE within the subsetting operations. For instance,
mat[3, , drop = FALSE] [,1] [,2] [,3] [,4]
[1,] 31 32 33 43
preserves the structure of the matrix. You can see this from the result, which is now shown as a matrix, as well as from the logical operators
is.vector(mat[3, , drop = FALSE])[1] FALSE
is.matrix(mat[3, , drop = FALSE])[1] TRUE
In programming, adding drop = FALSE is usually a good idea as it preserves the data structure. With vectors, the subsetting operator [] preserved the structure of the vector while [[ ]] acted as the simplifying operator. With matrices, [] act as the simplifying operator. To preserve the structure, you need to add drop = FALSE or drop = F.
You can subset multiple columns or rows. Suppose you need columns n to k of mat. You can subset these using mat[, n:k]. For instance, subsetting the 2nd to 4th column:
mat[, 2:4] [,1] [,2] [,3]
[1,] 12 13 41
[2,] 22 23 42
[3,] 32 33 43
[4,] 42 34 44
Note that in this case, the structure is preserved: the simplest possible data structure to show the result of the subsetting operation is a matrix.
Similarly, substting row m to l, is done using mat[m:l, ]. For instance, with m = 2 and l = 4 you subset the 2nd to 4th row:
mat[2:4, ] [,1] [,2] [,3] [,4]
[1,] 21 22 23 42
[2,] 31 32 33 43
[3,] 41 42 34 44
mat[m:l, n:k] subsets a range: the elements on row m to l and in columns n to k. For instance, if you need the elements in rows 2 to 4 and in columns 1 to 3:
mat[2:4, 1:3] [,1] [,2] [,3]
[1,] 21 22 23
[2,] 31 32 33
[3,] 41 42 34
If you need a specific number of rows or columns who are not in a range, you can identify them within vector using c(m, l, ...). For instance
mat[c(1, 3), c(2, 4)] [,1] [,2]
[1,] 12 41
[2,] 32 43
mat[c(1, 3), ] [,1] [,2] [,3] [,4]
[1,] 11 12 13 41
[2,] 31 32 33 43
mat[, c(2, 4)] [,1] [,2]
[1,] 12 41
[2,] 22 42
[3,] 32 43
[4,] 42 44
Using negative index numbers, you tell R that you don’t want to extract those rows or columns. For instance, to show all elements in mat except those in the first row and first column, mat[-1, -1] shows:
mat[-1, -1] [,1] [,2] [,3]
[1,] 22 23 42
[2,] 32 33 43
[3,] 42 34 44
You can use negative indices to extract one or more rows or columns or ranges:
mat[, -3:-4] [,1] [,2]
[1,] 11 12
[2,] 21 22
[3,] 31 32
[4,] 41 42
mat[-1:-3, ][1] 41 42 34 44
Note that in this case, R simplifies the output to a vector (mat has 4 rows can you extract all except the first three). Here you have an example where you would change the data structure by subsetting all rows except 1. To avoid that, you can use the preserving operator:
mat[-1:-3, , drop = F] [,1] [,2] [,3] [,4]
[1,] 41 42 34 44
mat[-1:-2, -1:-2] [,1] [,2]
[1,] 33 43
[2,] 34 44
Note the you can select multiple columns or rows not in a range using -c(k, l), e.g. extracting all columns except 1 and 3:
mat[, -c(1, 3)] [,1] [,2]
[1,] 12 41
[2,] 22 42
[3,] 32 43
[4,] 42 44
With names matrices, you can also refer to the names of the columns and rows. Let’s add row and column names to mat:
colnames(mat) <- colnames(mat, do.NULL = FALSE, prefix = "var_")
rownames(mat) <- rownames(mat, do.NULL = FALSE, prefix = "row_")
mat var_1 var_2 var_3 var_4
row_1 11 12 13 41
row_2 21 22 23 42
row_3 31 32 33 43
row_4 41 42 34 44
You can now subset this matrix using `mat[“rowname”, “columnname”]. For instance, extracting the element on row 2 and column 3:
mat["row_2", "var_3"][1] 23
Subsetting all elements in column var_3:
mat[, "var_3"]row_1 row_2 row_3 row_4
13 23 33 34
Note that R retains the row names in this case, however, you loose the structure of the matrix. To avoid this, add drop = F:
mat[, "var_3", drop = F] var_3
row_1 13
row_2 23
row_3 33
row_4 34
R also shows the names if you subset a named matrix using indices:
mat[, 3, drop = F] var_3
row_1 13
row_2 23
row_3 33
row_4 34
To extract all elements in row 2 and keep the structure:
mat["row_2", , drop = F] var_1 var_2 var_3 var_4
row_2 21 22 23 42
You can also collect the names a vector and subset multiple rows:
mat[c("row_1", "row_3"), ] var_1 var_2 var_3 var_4
row_1 11 12 13 41
row_3 31 32 33 43
or multiple columns:
mat[, c("var_1", "var_3")] var_1 var_3
row_1 11 13
row_2 21 23
row_3 31 33
row_4 41 34
or both:
mat[c("row_1", "row_3"), c("var_1", "var_3")] var_1 var_3
row_1 11 13
row_3 31 33
Recall that you can subset a vector using a logical vector. For a matrix, this is also true. However, in this case, the result is not a matrix but a vector. This vector includes all elements for which the condition returned TRUE. Let’s create a random logical matrix:
cond = matrix(sample(c(TRUE, FALSE), 16, TRUE), nrow = 4, ncol = 4)
cond [,1] [,2] [,3] [,4]
[1,] FALSE TRUE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE TRUE TRUE FALSE
[4,] TRUE TRUE FALSE FALSE
Here, we have a matrix whose elements are either TRUE or FALSE. We can use this matrix to extract the values in mat who are in the same position as the value TRUE in the matrix cond. To do so, we can use:
mat[cond][1] 21 41 12 32 42 33
Withing the [] you can include various conditions, for instance, to extract all elements in mat larger than 25, you can use
mat[mat > 25] [1] 31 41 32 42 33 34 41 42 43 44
You can further refine the condition and apply it to only one column or one row. For instance to extract all rows whose value in the first row is larger than 25 you can define this condition:
cond <- mat[, 1] > 25
condrow_1 row_2 row_3 row_4
FALSE FALSE TRUE TRUE
If you include this condition in the subsetting operator for the rows, you’ll see all columns for the rows whose value in the first column in larger than 25:
mat[cond, ] var_1 var_2 var_3 var_4
row_3 31 32 33 43
row_4 41 42 34 44
Collecting the elements in a vector, allows you to verify if they are also elements in the matrix. For instance, extracting the elements in mat who are equal to 12, 22, 33, 44 or 55, is done using
mat[mat %in% c(12, 22, 33, 44, 55)][1] 12 22 33 44
Subsetting using logical conditions also allows you to subset a named matrix using regular expressions. Recall that the grepl() function outputs a logical vector. If a matrix has column or row names, you can use these in grepl() to extract observations (rows) or variables (columns) that match a regular expressions. Suppose that you want to extract all observations on row_1, row_2 and row_3. Here, a simple regular expression would be “[0-3]”. This regular expression matches all rows that include ”” and one digit equal to 0, 1, 2 or 3. grepl() needs to find matches in the row names of mat:
grepl(pattern = "_[0-3]", x = rownames(mat))[1] TRUE TRUE TRUE FALSE
We can now use this expression to extract the observations. To do so, you either create a vector cond to store the result of grepl() which you can then use to subset:
cond <- grepl(pattern = "_[0-3]", x = rownames(mat))
mat[cond, ] var_1 var_2 var_3 var_4
row_1 11 12 13 41
row_2 21 22 23 42
row_3 31 32 33 43
As an alternative, you use the grepl() in the subsettig operation:
mat[grepl(pattern = "_[0-3]", x = rownames(mat)), ] var_1 var_2 var_3 var_4
row_1 11 12 13 41
row_2 21 22 23 42
row_3 31 32 33 43
Using this last method is probably less likely to result in code that is easy to read. In other words, if the pattern is complex, it it in general a good idea to use the first method.
You can do the same with column names. For instance, extracting all variables var_2, var_3 and var_4 can be done through:
mat[, grepl(pattern = "_[2-4]", x = colnames(mat))] var_2 var_3 var_4
row_1 12 13 41
row_2 22 23 42
row_3 32 33 43
row_4 42 34 44
Combining both subsets both rows as well as columns:
mat[grepl(pattern = "_[0-3]", x = rownames(mat)), grepl(pattern = "_[2-4]", x = colnames(mat))] var_2 var_3 var_4
row_1 12 13 41
row_2 22 23 42
row_3 32 33 43
The subset(x, subset, select, drop = FALSE, ...) function allows you to extract columns, defined in select from the matrix x using a logical index defined in subset. Using this function, you can subset rows and select which columns R needs to return. For instance, to selects the rows in columns 1 and 4 of mat if the value in column 2 is larger than 20 mat[, 2] > 20, you would use:
subset(mat, subset = mat[, 2] > 20, select = c(1, 4)) var_1 var_4
row_2 21 42
row_3 31 43
row_4 41 44
In the select argument, you can use the usual subsetting methods:
subset(mat, subset = mat[, 2] > 20, select = 2:4) var_2 var_3 var_4
row_2 22 23 42
row_3 32 33 43
row_4 42 34 44
subset(mat, subset = mat[, 2] > 20, select = -4) var_1 var_2 var_3
row_2 21 22 23
row_3 31 32 33
row_4 41 42 34
Using the row names of the matrix, you can also use grepl(). For instance, selecting columns 1 and 4 and only rows whose name includes “3” or “4” uses:
cond <- grepl(pattern = "row_[3-4]", rownames(mat))
subset(mat, cond, c(1, 4)) var_1 var_4
row_3 31 43
row_4 41 44
If you don’t use select =, by default, R return all columns. Excluding the subset argument will return all selected columns.
If you have a square matrix, you can extract the diagonal elements using diag(x). Extracting the diagonal elements from `mat:
diag(mat, names = TRUE)[1] 11 22 33 44
To extract the upper or lower triangular part of a square matrix, there are two functions: upper.tri(x, diag = FALSE) and lower.tri(x, diag = FALSE). The first subsets the upper triangular part, excluding the diagonal. The second the lower triangular part. Both include the square matrix as the first argument. The second argument determines is the diagonal is included or not (default). The outcome is a logical vector that can be used to subset the matrix.
uptri <- upper.tri(mat, diag = FALSE)
uptri [,1] [,2] [,3] [,4]
[1,] FALSE TRUE TRUE TRUE
[2,] FALSE FALSE TRUE TRUE
[3,] FALSE FALSE FALSE TRUE
[4,] FALSE FALSE FALSE FALSE
Applying these two function to mat:
mat[uptri][1] 12 13 23 41 42 43
Extracting the lower triangular part can be done is a similar way. If we add the diagonal,
lotri <- lower.tri(mat, diag = TRUE)
mat[lotri] [1] 11 21 31 41 22 32 42 33 34 44
Create 3x3 matrix, mat_0 as a sequence from 101-109.
mat_0 <- matrix(101:109, 3, 3)Using this matrix, extract
mat_0[2, 3][1] 108
mat_0[, 2, drop = FALSE] [,1]
[1,] 104
[2,] 105
[3,] 106
mat_0[1, , drop = FALSE] [,1] [,2] [,3]
[1,] 101 104 107
mat_0[, -1] [,1] [,2]
[1,] 104 107
[2,] 105 108
[3,] 106 109
mat_0[, c(1, 3)] [,1] [,2]
[1,] 101 107
[2,] 102 108
[3,] 103 109
mat_0[c(1, 3), ] [,1] [,2] [,3]
[1,] 101 104 107
[2,] 103 106 109
Let’s now add names to mat_0: “obs_1”, … for the rows and “var_1” … for the columns:
colnames(mat_0) <- colnames(mat_0, do.NULL = FALSE, prefix = "var_")
rownames(mat_0) <- rownames(mat_0, do.NULL = FALSE, prefix = "row_")Using these names, extract
var_1 preserving the matrix structure of the result:mat_0[, "var_1", drop = FALSE] var_1
row_1 101
row_2 102
row_3 103
row_1 and row_3`mat_0[c("row_1", "row_3"), ] var_1 var_2 var_3
row_1 101 104 107
row_3 103 106 109
Extract all the values larger than 104:
mat_0[mat_0 > 104][1] 105 106 107 108 109
Which values are on the diagonal of mat_0?
diag(mat_0)[1] 101 105 109
Extract the lower triangular part of mat_0 excluding the diagonal.
mat_0[upper.tri(mat_0, diag = FALSE)][1] 104 107 108
You can change individual elements of a matrix by reassigning them a new value. Suppose you want to change the value on row 2 and column 3 of mat from 23 into 123, you can use
mat[2, 3] <- 123
mat var_1 var_2 var_3 var_4
row_1 11 12 13 41
row_2 21 22 123 42
row_3 31 32 33 43
row_4 41 42 34 44
If you want to change all values less than 25 into 0, you can subset using this condition and reassign the values of the elements where the condition is TRUE:
mat[mat < 25] <- 0
mat var_1 var_2 var_3 var_4
row_1 0 0 0 41
row_2 0 0 123 42
row_3 31 32 33 43
row_4 41 42 34 44
Suppose that you have a matrix, mat with 6 rows and 20 columns:
mat <- matrix(1:120, 6, 20)You can change the dimensions of this matrix using dim(). For instance, if you want to change this matrix into a 3x40 matrix:
dim(mat) <- c(3, 40)Note that you can do this as long as the length of the matrix is unaffected. In other words, the number of elements in both matrices must be the same.
Using rbind() (row bind) and cbind() (column bind) functions you can combine vectors and matrices. The first, rbind() combines by rows: it stacks on vector or matrix on top of the other. To do so, the vectors and matrices that will be combined need to have the same number of columns. cbind() adds adds the vectors and matrices next to each other. The vectors and matrices in this function need to have the same number of rows.
Suppose that you have two matrices, mat_1 (filled with 1’s) and mat_2 (filled with 2’s):
mat_1 <- matrix(1, nrow = 3, ncol = 2)
mat_1 [,1] [,2]
[1,] 1 1
[2,] 1 1
[3,] 1 1
mat_2 <- matrix(2, nrow = 3, ncol = 1)
mat_2 [,1]
[1,] 2
[2,] 2
[3,] 2
They both have the same number of rows. That means that you can bind both and add the columns of mat_2 to those or mat_1
mat_c12 <- cbind(mat_1, mat_2)
mat_c12 [,1] [,2] [,3]
[1,] 1 1 2
[2,] 1 1 2
[3,] 1 1 2
Note that cbind(mat2, mat1) would add the columns of mat_1 to those of mat_2:
mat_c21 <- cbind(mat_2, mat_1)
mat_c21 [,1] [,2] [,3]
[1,] 2 1 1
[2,] 2 1 1
[3,] 2 1 1
You can add a matrix mat_3 (filled with 3’s) with the same number of columns as e.g. mat_1, you can add these rows to those of mat_1 using rbind():
mat_3 = matrix(3, nrow = 2, ncol = 2)
mat_r13 <- rbind(mat_1, mat_3)
mat_r13 [,1] [,2]
[1,] 1 1
[2,] 1 1
[3,] 1 1
[4,] 3 3
[5,] 3 3
If you reserve the order, you would add the rows of mat_1 to those of mat_3:
mat_r31 <- rbind(mat_3, mat_1)
mat_r31 [,1] [,2]
[1,] 3 3
[2,] 3 3
[3,] 1 1
[4,] 1 1
[5,] 1 1
When we discussed subsetting a matrix, we introduced negative index positions to subset all but the rows/columns with a negative index. This is the first approach if you want to remove a row or a column. Suppose for instance that you want to remove the last two rows of mat_r31, you use their negative index positions and save the matrix as mat_r31. Note that you can specify the negative positions using a range or you can collect them in a vector and add a minus sign c(). Here, we wil use the last appraoch:
mat_r31 <- mat_r31[-c(4, 5), ]
mat_r31 [,1] [,2]
[1,] 3 3
[2,] 3 3
[3,] 1 1
You can remove the first two columns from mat_c12 in a similar way:
mat_c12 <- mat_c12[, -1:-2]
mat_c12[1] 2 2 2
The second approach uses a logical vector where a value TRUE will keep the row or column and a value FALSE will remove that column of row. To illustrate this approach, we’ll use mat. Suppose you want to remove columns 1 and 3. The logical vector would then be c(FALSE, TRUE, FALSE, TRUE). Using this vector to subset the matrix
mat_keep <- c(FALSE, TRUE, FALSE, TRUE)
mat[, mat_keep] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 4 10 16 22 28 34 40 46 52 58 64 70 76 82
[2,] 5 11 17 23 29 35 41 47 53 59 65 71 77 83
[3,] 6 12 18 24 30 36 42 48 54 60 66 72 78 84
[,15] [,16] [,17] [,18] [,19] [,20]
[1,] 88 94 100 106 112 118
[2,] 89 95 101 107 113 119
[3,] 90 96 102 108 114 120
If you reassign this result to mat you have effectively removed columns 1 and 3. You can use a similar approach to keep/remove rows.
Note you don’t need to write the logical vector by hand. Usually, this vector will be the outcome of a condition.
Deconstructing a matrix refers to the operation where change the dimension and change the matrix into a vector. To do so, you can use the c() function. This function changes the matrix into a vector. For instance, applying this function to mat results in a vector.
c(mat) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120
As you can see, the vector starts with the first column, then add the second column, the third and the fourth.
Using the next three matrices:
mat_1 <- matrix(1:12, 3, 4)
mat_2 <- matrix(11:22, 3, 4)
mat_3 <- matrix(31:42, 3, 4)mat_3 to those of mat_1:cbind(mat_1, mat_3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 4 7 10 31 34 37 40
[2,] 2 5 8 11 32 35 38 41
[3,] 3 6 9 12 33 36 39 42
mat_2 to those of mat_1:rbind(mat_1, mat_2) [,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
[4,] 11 14 17 20
[5,] 12 15 18 21
[6,] 13 16 19 22
mat_2mat_2[-2, ] [,1] [,2] [,3] [,4]
[1,] 11 14 17 20
[2,] 13 16 19 22
mat_2 using a logical vectormat_2[c(T, F, T), ] [,1] [,2] [,3] [,4]
[1,] 11 14 17 20
[2,] 13 16 19 22
remove column 1 and 3 from mat_1. Do so in three ways:
Option 1:
mat_1[, -c(1, 3)] [,1] [,2]
[1,] 4 10
[2,] 5 11
[3,] 6 12
mat_1[, c(-1, -3)] [,1] [,2]
[1,] 4 10
[2,] 5 11
[3,] 6 12
mat_1[, c(F, T, F, T)] [,1] [,2]
[1,] 4 10
[2,] 5 11
[3,] 6 12
Change the values of mat_1 on the upper triangular part, excluding the diagonal, to 0.
mat_1[upper.tri(mat_1, diag = FALSE)] <- 0
mat_1 [,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 2 5 0 0
[3,] 3 6 9 0
Turn mat_3 into a vector.
c(mat_3) [1] 31 32 33 34 35 36 37 38 39 40 41 42
Recall that many function in R are vectorized. For matrices, that means that they apply to the individual elements of a matrix. This holds for most operators: they work on an element by element basis. However, note that in this case, this requires that the dimensions of both matrices are the same.
Operators such as addition, subtraction, division, multiplication, integer division or modulus can be used with matrices. Using
mat_1 <- matrix(c(2, 4, 8, 10), 2, 2)
mat_2 <- matrix(c(1, 2, 3, 4), 2, 2)we’ll illustrate these operators.
mat_1 + mat_2 [,1] [,2]
[1,] 3 11
[2,] 6 14
mat_1 - mat_2 [,1] [,2]
[1,] 1 5
[2,] 2 6
mat_1 * mat_2 [,1] [,2]
[1,] 2 24
[2,] 8 40
mat_1 / mat_2 [,1] [,2]
[1,] 2 2.666667
[2,] 2 2.500000
mat_1 %/% mat_2 [,1] [,2]
[1,] 2 2
[2,] 2 2
mat_1 %% mat_2 [,1] [,2]
[1,] 0 2
[2,] 0 2
Most functions are also vectorized. In other words, they work on an element by element basis if applied to a matrix. For instance
abs(-1 * mat_1) [,1] [,2]
[1,] 2 8
[2,] 4 10
log(mat_1) [,1] [,2]
[1,] 0.6931472 2.079442
[2,] 1.3862944 2.302585
mat_1^3 [,1] [,2]
[1,] 8 512
[2,] 64 1000
sqrt(mat_1) [,1] [,2]
[1,] 1.414214 2.828427
[2,] 2.000000 3.162278
exp(mat_1) [,1] [,2]
[1,] 7.389056 2980.958
[2,] 54.598150 22026.466
Here, the functions are applied to all elements of the matrix mat_1. Note that this is not necessary. If you subset the column or rows of mat_1, R will apply a function only to the extracted rows or columns. This also holds for the mathematical operators. For instance:
mat_1 to the second column of mat_2:mat_1[, 1] + mat_2[, 2][1] 5 8
mat_1log(mat_1[1, ])[1] 0.6931472 2.0794415
We covered a number of statistical function and discussed how they are applied to vectors. By extension, you can use these for matrices too. Subsetting a column, row or element allows you to apply these function to all elements in one or multiple rows, columns, elements or ranges. For instance,
mat_2:pt(mat_2[, 1], df = 5)[1] 0.8183913 0.9490303
mat_1:df(mat_1[1:2, ], df1 = 6, df2 = 2) [,1] [,2]
[1,] 0.13494377 0.013271040
[2,] 0.04537656 0.008770781
In the previous section, we applied all functions to all elements of the matrix or a subset of columns and or rows. R includes a number of functions that you can apply to every column or every row. In addition, you can use the apply() function apply a function per row or per column.
R includes function that work per row or column of a matrix. To illustrate some these functions, we’ll use the following random matrix:
n = 1000
m = 5
matr <- matrix(rnorm(n * m), n, m)
matu <- matrix(runif(n * m, min = 0, max = 1), n, m)
colnames(matr) <- colnames(matr, do.NULL = FALSE, prefix = "var_")
colnames(matu) <- colnames(matu, do.NULL = FALSE, prefix = "var_")
rownames(matr) <- rownames(matr, do.NULL = FALSE, prefix = "obs_")
rownames(matu) <- rownames(matu, do.NULL = FALSE, prefix = "obs_")The mean of every column of the first matrix, matr, should be (close to) zero and its standard deviation (close to) one. For the second matrix, where each element is drawn from a uniform distribution with minimum zero and maximum one, the sum of each column should be close to 500 (the number of observations multiplies with the expected value 0.5), the minimum should be close to zero and the maximum should be (close to) one.
R includes functions to calculate the means per column or per row: colMeans(x, na.rm = FALSE, dims = 1) and rowMeans(x, na.rm = FALSE, dims = 1). In both functions, x refers to the matrix. If you matrix include missing values, you need to change the second argument from FALSE to TRUE. The arguments dims = 1 allows you to specify which dimensions are regarded as row or column. You can lease this on its default value. Using these functions, you can calculate the mean per column:
colMeans(matr, na.rm = TRUE) var_1 var_2 var_3 var_4 var_5
0.005143367 -0.049817391 -0.001946084 0.020798152 0.045079932
or the mean per row: rowMeans(matr, na.rm = TRUE). In this case, given the size of the matrix, the output would be very long.
Note that you can reduce the number of columns by subsetting the matrix. For instance, to determine the column means of columns var_2, var_3 and var_5, you can use the grepl() function to subset these columns:
colMeans(matr[, grepl(pattern = "_[2-4]", x = colnames(matr))], na.rm = TRUE) var_2 var_3 var_4
-0.049817391 -0.001946084 0.020798152
To calculate the sum of all values in a column or row R includes colSums(x, na.rm = FALSE) and rowSums(x, na.rm = FALSE). As with colMeans(), the first argument is the matrix while the second allows you to specify that missing values should be disregarded in the calculation or not. Using this function to calculate the sum of all values per column in matu:
colSums(matu, na.rm = TRUE ) var_1 var_2 var_3 var_4 var_5
498.8963 494.7191 510.6848 491.3499 488.0277
You can use scale() to standardize the values per column. Recall that a standardized value is calculated as
\[ x_{stand} = {{(x - \overline{x})} \over{s}} \]
where \(\overline{x}\) is the mean and \(s\) is the standard deviation.
To illustrate this function, we’ll redefine matr and determine its values as draws from a normal distribution with mean 5 and standard deviation 10:
matr <- matrix(rnorm(n * m, 5, 10), n, m)If you check the column means, you’ll see that the are (close to) 5
colMeans(matr)[1] 5.300783 5.116847 5.083844 5.348894 4.590922
To standardize these values, you can use scale(x, center = TRUE, scale = TRUE). The first argument equals the matrix you want to scale. The second and third argument determine is you want to subtract the mean (i.e. you want to center the columns in matr) and divide by the standard deviation of every column in matr. By default, both are the case. Applying that function to matr:
matrs <- scale(matr, center = TRUE, scale = TRUE)If you now look at the means per column,
colMeans(matrs)[1] 1.811398e-17 6.473988e-18 1.907155e-17 2.709898e-17 4.207051e-17
you can verify that they are (close to) zero. The standard deviation is also (close to) one:
sd(matrs[, 1])[1] 1
sd(matrs[, 2])[1] 1
sd(matrs[, 3])[1] 1
sd(matrs[, 4])[1] 1
sd(matrs[, 5])[1] 1
If you set one of the arguments, center or scale to FALSE, R will not center (i.e. will not subtract the means) or will not scale (i.e. will not divide by the standard deviation). In addition, you can supply your down vector that R will use to both center and scale. Suppose that you don’t want to center, but want to scale by the sum of all values in a column, can can use:
matrsum <- scale(matr, center = FALSE, scale = colSums(matr))Using colSums(), you can verify this result:
colSums(matrsum)[1] 1 1 1 1 1
The apply() function allows you to apply any function to each separate row or column of a matrix. Recall that we used this function in Chapter 2. The function includes a number of arguments: apply(X, MARGIN, FUN, ..., simplify = TRUE). The first, x refers to the matrix. The second, MARGIN = is used to determine is the function is applied to all rows (MARGIN = 1), all columns (MARGIN = 2) or to a subset of rows or columns. To apply to function to both, you can use MARGIN = c(1, 2). FUN refers to the function you want to apply to every column. The three dots ... refer to optional arguments for FUN. For instance, you can add na.rm = TRUE as an optional argument if FUN = sd. The last argument tells R to simplify the output if possible. For matrices, apply simplifies to a vector or array. In case simplify = FALSE R will return a list. A list is a data structure that we will discuss later in this chapter.
This function allows you to avoid for loops. Although it is not always possible to avoid for loops, in general they are slower than the apply() function. In other words, it is generally a good idea to try to use apply() as opposed to writing a for loop. For small datasets, the difference might be small. However, for larger datasets, the difference in efficiency can be quite large.
Let’s use a couple of examples to illustrate how you can use apply(). Here, we will apply a function to the columns. For rows, the output would be too long. Note that here too, you can subset the matrix you include in the first argument. Let’s start with the two functions we already met: the mean and sum of the columns. Using apply, you can calculate the mean of every column in matr
apply(matr, MARGIN = 2, FUN = mean, na.rm = TRUE, simplify = TRUE)[1] 5.300783 5.116847 5.083844 5.348894 4.590922
For the sum of matu:
apply(matu, MARGIN = 2, FUN = sum, simplify = TRUE) var_1 var_2 var_3 var_4 var_5
498.8963 494.7191 510.6848 491.3499 488.0277
Using a for loop would require you to write
mat_mean <- matrix(0, 1, 5)
i = 1
for (i in 1:5) {
mat_mean[i] <- mean(matr[, i])
}
mat_mean [,1] [,2] [,3] [,4] [,5]
[1,] 5.300783 5.116847 5.083844 5.348894 4.590922
The argument for FUN can include most functions that we have seen so far. A couple of examples to illustrate:
matrapply(matr, MARGIN = 2, FUN = quantile, simplify = TRUE) [,1] [,2] [,3] [,4] [,5]
0% -21.249655 -23.997095 -29.775836 -26.053802 -23.360434
25% -1.191518 -1.251438 -1.893282 -1.246803 -2.013940
50% 4.960671 5.078630 5.033694 5.362717 4.899435
75% 11.814882 11.494527 11.955620 11.887836 10.981581
100% 37.325787 30.820108 38.755552 38.062730 35.365880
matr:apply(matr, MARGIN = 2, FUN = sd, simplify = TRUE)[1] 9.513057 9.520386 10.414700 10.239027 9.948180
matr:apply(matr, MARGIN = 2, FUN = median, simplify = TRUE)[1] 4.960671 5.078630 5.033694 5.362717 4.899435
matu (recall, should be (close to) 0)apply(matu, MARGIN = 2, FUN = min, simplify = TRUE) var_1 var_2 var_3 var_4 var_5
3.130340e-04 3.611716e-05 3.060959e-03 3.400710e-04 1.325362e-03
apply(matu, MARGIN = 2, FUN = which.min, simplify = TRUE)var_1 var_2 var_3 var_4 var_5
926 54 956 283 579
matu (recall, should be (close to) 1):apply(matu, MARGIN = 2, FUN = max, simplify = TRUE) var_1 var_2 var_3 var_4 var_5
0.9981666 0.9997119 0.9987890 0.9996086 0.9992079
apply(matu, MARGIN = 2, FUN = which.max, simplify = TRUE)var_1 var_2 var_3 var_4 var_5
22 46 291 743 947
matuapply(matu[1:10, ], MARGIN = 2, FUN = cumsum, simplify = TRUE) var_1 var_2 var_3 var_4 var_5
obs_1 0.600226 0.0169425 0.1018651 0.6426288 0.158503
obs_2 1.265842 0.1729380 0.3665798 1.5988249 0.946418
obs_3 2.119343 1.1211914 0.4628187 1.9303313 1.093771
obs_4 2.847117 1.1250811 1.1040666 2.2613488 2.054247
obs_5 3.028297 2.1034462 1.1882292 2.4386040 2.696421
obs_6 3.540854 2.2688245 1.7402507 3.3594061 2.948909
obs_7 3.782852 2.7938194 1.7781946 4.3171831 3.733870
obs_8 4.271986 3.2511397 2.4019932 4.7598717 3.736695
obs_9 4.524177 3.9606292 2.7464903 4.8748816 3.979814
obs_10 4.985903 4.1790758 3.7290677 5.4866638 4.220611
matuapply(matu[1:10, ], MARGIN = 2, FUN = cumprod, simplify = TRUE) var_1 var_2 var_3 var_4 var_5
obs_1 0.6002259518 1.694250e-02 1.018651e-01 0.6426288278 1.585030e-01
obs_2 0.3995199498 2.642955e-03 2.696519e-02 0.6144791858 1.248869e-01
obs_3 0.3409906601 2.506191e-03 2.595101e-03 0.2037037471 1.840246e-02
obs_4 0.2481642950 9.748173e-06 1.664103e-03 0.0674295104 1.767512e-02
obs_5 0.0449622375 9.537273e-06 1.400551e-04 0.0119522310 1.135051e-02
obs_6 0.0230457362 1.577258e-06 7.731343e-05 0.0110056391 2.865859e-03
obs_7 0.0055770128 8.280525e-07 2.933578e-06 0.0105409480 2.249587e-03
obs_8 0.0027279104 3.786852e-07 1.829962e-06 0.0046663574 6.355042e-06
obs_9 0.0006879535 2.686732e-07 6.304165e-07 0.0005366777 1.545037e-06
obs_10 0.0003176458 5.869076e-08 6.194330e-07 0.0003283298 3.720400e-07
apply(matu, MARGIN = 2, FUN = sample, size = 100, replace = FALSE, simplify = TRUE) var_1 var_2 var_3 var_4 var_5
[1,] 0.427874653 0.2962461817 0.245244351 0.71371006 0.614320555
[2,] 0.619837658 0.9732767020 0.005746100 0.69770648 0.104410692
[3,] 0.211551783 0.7025180887 0.634824334 0.41890317 0.751804105
[4,] 0.937301772 0.8625655503 0.313025048 0.25229257 0.727244976
[5,] 0.468054898 0.1333571775 0.412062724 0.14209116 0.205491589
[6,] 0.652220312 0.1343263080 0.982577400 0.88488528 0.545954252
[7,] 0.094731264 0.4035322617 0.792527967 0.20194269 0.846915320
[8,] 0.963287048 0.6726470231 0.396185544 0.03091596 0.379442458
[9,] 0.829897380 0.1997247315 0.415923977 0.20976465 0.203501021
[10,] 0.395167737 0.0038896375 0.377768022 0.08304928 0.522725631
[11,] 0.663212207 0.8908440617 0.932076235 0.68903437 0.069070730
[12,] 0.010062481 0.5519522047 0.340680641 0.45924904 0.499460750
[13,] 0.163756774 0.4289824814 0.263133530 0.51283482 0.205762443
[14,] 0.312948626 0.6088482901 0.699145186 0.18986819 0.818035963
[15,] 0.569327792 0.4906537151 0.495169261 0.37241806 0.360954494
[16,] 0.688230235 0.3483616519 0.923396978 0.67985158 0.527013391
[17,] 0.304477348 0.7016855879 0.738518027 0.22485392 0.699525842
[18,] 0.914051189 0.0028812722 0.949652195 0.21667207 0.184232822
[19,] 0.015076805 0.7671682036 0.071696028 0.84160925 0.434728433
[20,] 0.680477128 0.7291652344 0.789702306 0.04536275 0.132914925
[21,] 0.483864088 0.4406194480 0.028702055 0.01458980 0.045473437
[22,] 0.090609562 0.3710276980 0.862361548 0.40796349 0.126826719
[23,] 0.535610609 0.8934164054 0.939425986 0.86611405 0.031051245
[24,] 0.406181956 0.3546411851 0.343321695 0.74156995 0.401845718
[25,] 0.650372807 0.2254920613 0.007305878 0.96089933 0.294542913
[26,] 0.523949534 0.0001744141 0.447297217 0.40851562 0.764654150
[27,] 0.744043353 0.7912811625 0.052715830 0.96053204 0.409537329
[28,] 0.976969536 0.1195863134 0.982976033 0.74644039 0.110714359
[29,] 0.133260578 0.2870601334 0.115480395 0.13382330 0.122441713
[30,] 0.214659605 0.9601216391 0.788732514 0.11746025 0.800846495
[31,] 0.606783367 0.6674754317 0.860529469 0.04416643 0.575592778
[32,] 0.701639672 0.8425107629 0.600073412 0.05594724 0.788355635
[33,] 0.395247573 0.4100571568 0.167690868 0.70724876 0.945276574
[34,] 0.997906438 0.2519476619 0.735430416 0.48066398 0.747088682
[35,] 0.196408861 0.4243587430 0.596180666 0.51482828 0.957349570
[36,] 0.727074796 0.9684854858 0.652410179 0.96595233 0.722289495
[37,] 0.986973475 0.0895374368 0.845779282 0.41230179 0.293967669
[38,] 0.773164391 0.9638337884 0.552748314 0.06159046 0.458007812
[39,] 0.796141559 0.7765745323 0.846832885 0.20770468 0.298801046
[40,] 0.624487828 0.3564885638 0.660334249 0.17683940 0.057621229
[41,] 0.817350582 0.5913988259 0.220299744 0.86317435 0.299523436
[42,] 0.967824784 0.4424968567 0.921304168 0.26098331 0.489219855
[43,] 0.914439068 0.3942986336 0.361325487 0.60169566 0.616653974
[44,] 0.768421780 0.4217687396 0.295858546 0.37798105 0.077182890
[45,] 0.016625161 0.6658115482 0.165320277 0.77455860 0.158503026
[46,] 0.440597997 0.9446910012 0.243012169 0.06447767 0.420961288
[47,] 0.719860912 0.1772992392 0.560751369 0.44902747 0.849164016
[48,] 0.839452247 0.8715715578 0.214128093 0.23395167 0.642936618
[49,] 0.833640202 0.8292887262 0.674816251 0.66711147 0.277001954
[50,] 0.338264248 0.3612342407 0.238634258 0.94576723 0.542285567
[51,] 0.542008152 0.0192937264 0.760847150 0.01364760 0.358393266
[52,] 0.392433255 0.1112235764 0.775153869 0.87165202 0.237003826
[53,] 0.164499223 0.4162324592 0.275682220 0.19663832 0.074004117
[54,] 0.369945183 0.4217562771 0.262902491 0.78702023 0.042206783
[55,] 0.240512830 0.2073696717 0.976909903 0.05358947 0.167880482
[56,] 0.414119643 0.8492104961 0.605697764 0.41609591 0.872418524
[57,] 0.662489830 0.1843502908 0.716483586 0.26179877 0.819733626
[58,] 0.758360146 0.3945711800 0.928629791 0.01249312 0.650508635
[59,] 0.599016845 0.3895975223 0.367571383 0.91714007 0.002824981
[60,] 0.336946693 0.1554408779 0.450853944 0.76029389 0.491443093
[61,] 0.716182998 0.9103479120 0.725621383 0.01429185 0.686092847
[62,] 0.061481926 0.0163614631 0.989081329 0.52692042 0.761924360
[63,] 0.286311698 0.6886869357 0.764402661 0.19420963 0.645052908
[64,] 0.743579587 0.6328592277 0.015155191 0.06716482 0.375276764
[65,] 0.338893080 0.8474844724 0.048234599 0.92080208 0.868130674
[66,] 0.955673186 0.1031871077 0.428569158 0.28122884 0.376567396
[67,] 0.624607087 0.4536155986 0.790632120 0.38309418 0.860546991
[68,] 0.575755496 0.4721420540 0.053433392 0.47605170 0.634406123
[69,] 0.755386295 0.0770887041 0.696568140 0.17478258 0.801871024
[70,] 0.254544970 0.6029461438 0.617348765 0.71324939 0.334514807
[71,] 0.492098489 0.3417994273 0.806618107 0.64777243 0.158112442
[72,] 0.981012912 0.7096717325 0.890726835 0.97584609 0.060581106
[73,] 0.277054097 0.8711021238 0.514938284 0.22167477 0.422439821
[74,] 0.211788388 0.8441557020 0.972471799 0.47317091 0.444995942
[75,] 0.573928014 0.3699196551 0.576670796 0.15495689 0.749400497
[76,] 0.895156737 0.3571989350 0.218822891 0.06512148 0.393396356
[77,] 0.881618620 0.0430566731 0.635972946 0.93043863 0.766673522
[78,] 0.682188959 0.7933296857 0.625962837 0.03882977 0.034771726
[79,] 0.814984214 0.0663515609 0.524825637 0.84765934 0.265665035
[80,] 0.877733724 0.4871458802 0.025809797 0.82977315 0.754634449
[81,] 0.637299148 0.5170553513 0.833798896 0.46606657 0.207820970
[82,] 0.801481901 0.2931252166 0.136746070 0.24331348 0.826765755
[83,] 0.871730286 0.9466384773 0.064129959 0.64118513 0.762469654
[84,] 0.159506244 0.6049182811 0.029450050 0.56676077 0.697369893
[85,] 0.731072485 0.0996799695 0.164878438 0.58723558 0.221791439
[86,] 0.094668780 0.7252484844 0.710831779 0.93143246 0.950283931
[87,] 0.469556776 0.4755556541 0.319298754 0.60529539 0.649953627
[88,] 0.508050772 0.0509314602 0.015070460 0.17068511 0.869120294
[89,] 0.086548502 0.5792983784 0.353958581 0.61178214 0.111286720
[90,] 0.997671658 0.5400455620 0.197374821 0.05923918 0.284796121
[91,] 0.107591556 0.3836067128 0.052368591 0.14104294 0.417281280
[92,] 0.358777107 0.7570024268 0.086379471 0.13378116 0.746380477
[93,] 0.490966906 0.0319006587 0.891686819 0.50759548 0.242711229
[94,] 0.459255106 0.5008176900 0.482638942 0.87273702 0.338060063
[95,] 0.794867382 0.7031364362 0.456189326 0.09008759 0.870708651
[96,] 0.007536766 0.5311022992 0.284598397 0.92246731 0.065699746
[97,] 0.983863101 0.5435822248 0.992995270 0.76408524 0.590230260
[98,] 0.483696888 0.5903762605 0.520435375 0.91285026 0.627955907
[99,] 0.202509591 0.3787623248 0.370204954 0.84725329 0.556368871
[100,] 0.839834335 0.4521930083 0.804261910 0.18736976 0.368933453
Note that in this case, the ... in apply are used to include specify the sample size and replacement method: size = 100, replace = FALSE.
matr in decreasing order (here the first 10 rows):apply(matr[1:10, ], 2, sort, decreasing = TRUE, TRUE ) [,1] [,2] [,3] [,4] [,5]
[1,] 15.460149 16.2546301 15.91465424 21.2399511 20.27354208
[2,] 15.020452 13.5497243 12.07519698 18.7872667 6.64172071
[3,] 8.922464 10.4209857 6.55576309 17.0451525 6.26640391
[4,] 6.488385 9.3100825 5.60532134 10.5860546 4.51004705
[5,] 4.711979 4.5994621 0.09526772 2.4596832 2.70702889
[6,] 4.606092 2.4900037 -0.91912989 1.2480615 1.37693694
[7,] -1.793389 1.0875137 -0.99093526 0.8535719 -0.01652879
[8,] -3.080375 -0.3370674 -1.47375072 -2.9420104 -0.63829143
[9,] -6.356346 -10.4651753 -3.40353956 -3.0203727 -2.86726235
[10,] -6.714175 -16.3735425 -8.44512193 -22.1994413 -4.65856598
Note that in this case, the ... in apply are used to include specify the order decreasing = TRUE.
You are not limited to these predefined functions. In the FUN argument, you can define your own function. As we will see in Chapter 14, R allows you to build your own function. In addition, you can add so called anonymous functions or lambda function in the FUN argument of apply. To do to, you first write function(x) or use the shorthand \(x) and add the body of your function, e.g. mean(x)/sd(x). With apply(), x refers to a column is MARGIN = 2 and to the row if MARGIN = 1. Note that there is no comma between function(x) and the body of your function. Adding all these into apply(): apply(mat, 1/2, function(x) mean(x)/sd(x)). This statement could be read as: for each row (if MARGIN = 1) or each column (is MARGIN = 2), substitute that row/column for x in the function function(x). In other words, and assuming that MARGIN = 2, R applies the function function(x) to mat[, 1], then to mat[, 3], … until is reaches the last column. Each time, R stores the outcome in a vector, matrix or list and adds, where possible, the name of that column to that vector, matrix or list. In the previous examples, FUN = mean was actually shorthand for function(x) mean(x). As mean is a known function in R, you don’t need to use function(x). As this function is the third argument after mat and MARGIN, you can further shorten the apply() code to apply(mat, 2, mean)
For instance, to standardize all columns in a matrix, you can define function(x) (x - mean(x))/sd(x) or \(x) (x - mean(x))/sd(x). These functions are anonymous because they don’t have a name. Other functions, such as mean() or functions that you will write yourself have a name. This allows you to use these function throughout your code. Anonymous functions or lambda function only exist when used within code, but can not be called in subsequent parts of your code.
To illustrate, let’s standardize the first 10 rows of all columns in matr after adding 5 and multiplying with 10. Although we could include this restriction in the apply() function, we will first create a matrix with the first 10 rows:
matr10 <- 5 + 10 * matr[1:10, ]
matr10 [,1] [,2] [,3] [,4] [,5]
[1,] -12.93389 1.629326 125.751970 -216.99441 -23.672623
[2,] -58.56346 15.875137 164.146542 13.53572 4.834712
[3,] 94.22464 -99.651753 5.952677 175.45152 -41.585660
[4,] -25.80375 109.209857 -79.451219 217.39951 207.735421
[5,] 159.60149 140.497243 70.557631 -25.20373 -1.382914
[6,] -62.14175 50.994621 -9.737507 29.59683 32.070289
[7,] 52.11979 167.546301 -29.035396 17.48061 50.100470
[8,] 51.06092 -158.735425 61.053213 110.86055 67.664039
[9,] 155.20452 98.100825 -4.191299 192.87267 71.417207
[10,] 69.88385 29.900037 -4.909353 -24.42010 18.769369
Using matr10, we can now write the apply() command to standardize:
apply(matr10, 2, function(x) (x - mean(x))/sd(x)) [,1] [,2] [,3] [,4] [,5]
[1,] -0.6822876 -0.32897513 1.2872629 -2.0395166 -0.88910479
[2,] -1.2462908 -0.19075950 1.8035029 -0.2723077 -0.48205544
[3,] 0.6422431 -1.31162383 -0.3235164 0.9689141 -1.14488071
[4,] -0.8413651 0.71479209 -1.4718273 1.2904810 2.41511475
[5,] 1.4503320 1.01834838 0.5451391 -0.5692784 -0.57083542
[6,] -1.2905202 0.14997656 -0.5344812 -0.1491857 -0.09316522
[7,] 0.1218070 1.28078360 -0.7939538 -0.2420668 0.16428339
[8,] 0.1087188 -1.88486509 0.4173461 0.4737695 0.41506934
[9,] 1.3959834 0.60701010 -0.4599088 1.1024619 0.46865992
[10,] 0.3413793 -0.05468719 -0.4695635 -0.5632713 -0.28308583
or
apply(matr10, 2, \(x) (x - mean(x))/sd(x)) [,1] [,2] [,3] [,4] [,5]
[1,] -0.6822876 -0.32897513 1.2872629 -2.0395166 -0.88910479
[2,] -1.2462908 -0.19075950 1.8035029 -0.2723077 -0.48205544
[3,] 0.6422431 -1.31162383 -0.3235164 0.9689141 -1.14488071
[4,] -0.8413651 0.71479209 -1.4718273 1.2904810 2.41511475
[5,] 1.4503320 1.01834838 0.5451391 -0.5692784 -0.57083542
[6,] -1.2905202 0.14997656 -0.5344812 -0.1491857 -0.09316522
[7,] 0.1218070 1.28078360 -0.7939538 -0.2420668 0.16428339
[8,] 0.1087188 -1.88486509 0.4173461 0.4737695 0.41506934
[9,] 1.3959834 0.60701010 -0.4599088 1.1024619 0.46865992
[10,] 0.3413793 -0.05468719 -0.4695635 -0.5632713 -0.28308583
Here, we used “(x)” as shorthand for “function(x)”.
Let’s now use the apply() function to
mart10 using the min-max transformation (x - min(x))/(max(x) - min(x)). The outcome of this function rescales the values in every column to a 0-1 range:apply(matr10, 2, \(x) (x - min(x))/(max(x) - min(x))) [,1] [,2] [,3] [,4] [,5]
[1,] 0.22191368 0.4914917 0.8423854 0.0000000 0.07184726
[2,] 0.01613708 0.5351527 1.0000000 0.5306937 0.18618711
[3,] 0.70516872 0.1810818 0.3505939 0.9034333 0.00000000
[4,] 0.16387424 0.8212084 0.0000000 1.0000000 1.00000000
[5,] 1.00000000 0.9170991 0.6158055 0.4415133 0.16124888
[6,] 0.00000000 0.6427882 0.2861837 0.5676673 0.29542608
[7,] 0.51528761 1.0000000 0.2069634 0.5397751 0.36774319
[8,] 0.51051238 0.0000000 0.5767887 0.7547411 0.43818878
[9,] 0.98017090 0.7871610 0.3089516 0.9435378 0.45324233
[10,] 0.59539855 0.5781368 0.3060039 0.4433172 0.24207752
apply(matr10, 2, function(x) sum(x > 0))[1] 6 8 5 7 7
apply(matr10, 2, \(x) x + abs(min(x) - 0)) [,1] [,2] [,3] [,4] [,5]
[1,] 49.207857 160.36475 205.20319 0.0000 17.91304
[2,] 3.578288 174.61056 243.59776 230.5301 46.42037
[3,] 156.366395 59.08367 85.40390 392.4459 0.00000
[4,] 36.338003 267.94528 0.00000 434.3939 249.32108
[5,] 221.743237 299.23267 150.00885 191.7907 40.20275
[6,] 0.000000 209.73005 69.71371 246.5912 73.65595
[7,] 114.261542 326.28173 50.41582 234.4750 91.68613
[8,] 113.202667 0.00000 140.50443 327.8550 109.24970
[9,] 217.346269 256.83625 75.25992 409.8671 113.00287
[10,] 132.025602 188.63546 74.54187 192.5743 60.35503
matr10 which is larger 5 using a t-distribution with 9 degrees of freedom:apply(matr10, 2, function(x) pt(mean(x)-5, df = 9, lower.tail = FALSE))[1] 1.789074e-11 1.060196e-10 6.264105e-10 3.993985e-12 4.521195e-11
Using * R caculates the product of two matrices element-wise. You can also calculate the product of two matrices. In addition, R allows you to calculate e.g. the determinant of a (square) matrix. To illustrate we’ll use one square matrix A
A <- matrix(c(147, 258, 369, 123, 456, 789, 159, 483, 267), 3, 3)
A [,1] [,2] [,3]
[1,] 147 123 159
[2,] 258 456 483
[3,] 369 789 267
and a column vectors with 3 rows x:
x <- matrix(c(5, 10, 15), 3, 1)
x [,1]
[1,] 5
[2,] 10
[3,] 15
Using these two matrices, you can now do matrix algebra. For instance:
t(). For the square matrix A the element in position (i, j) changes position to (j, i).t(A) [,1] [,2] [,3]
[1,] 147 258 369
[2,] 123 456 789
[3,] 159 483 267
As you can see, 258, which is in position (2, 1) in A is now located in position (1, 2). Applying the transpose to x changes this vector from a column vector into a row vector:
t(x) [,1] [,2] [,3]
[1,] 5 10 15
nrow(first), ncol(second)). Here, A is a 3x3 matrix and x is a 3x1. Using %*% you can multiply both. The outcome is a 3x1 matrix.A %*% x [,1]
[1,] 4350
[2,] 13095
[3,] 13740
The element in position (1, 1) is equal to 147 * 5 + 123 * 10 + 159 * 15, the element in position (2, 1) is equal to 285 * 5 + 456 * 10 + 483 * 15 and the element in (3, 1) is equal to 369 * 5 + 789 * 10 + 267 * 15.
A is a square matrix so we can use det(A) to calculate the determinant;det(A)[1] -19060920
Here, det(A) = 147 * 456 * 267 + 123 * 483 * 369 + 258 * 789 * 159 - 159 * 456 * 369 - 123 * 258 * 267 - 147 * 789 * 483. Here, the determinant is different from zero. In other words, the columns in A are linearly independent.
sum(diag(A))[1] 870
solve(a, b) function. In general, this function solves a system of equations ax = b. If there is no value, b is set equal to the identity matrix. In that case, solve calculates the inverse:solve(A) [,1] [,2] [,3]
[1,] 0.013605587 -0.004858632 0.0006870078
[2,] -0.005736397 0.001018943 0.0015727992
[3,] -0.001851852 0.003703704 -0.0018518519
eigen(A)$values[1] 1079.22223 -273.74185 64.51962
eigen(A)$vectors [,1] [,2] [,3]
[1,] -0.2102804 -0.1744814 -0.9081183
[2,] -0.6517779 -0.4998281 0.3790710
[3,] -0.7286753 0.8483679 0.1778379
From you mathematics class, you may recall that the solution of a system of equations
\[ Ax = B \]
equals
\[ x = BA^{-1}. \]
Let’s first define B:
B <- A %*% xIn R, you can find the solution as solve(A, B):
solve(A, B) [,1]
[1,] 5
[2,] 10
[3,] 15
Recall that we used this function to calculate the inverse. There, this function sets B equal to the identity matrix, in other words
\[ x = BA^{-1} = IA^{-1} = A^{-1} \]
There are many other packages available that you can install and use to do matrix calculations. For instance, {matrixStats} is a package that includes many functions that apply to rows and columns of a matrix. To use the package, you need to install if first. To do so, use install.packages("matrixStats"). The functions in this package are faster and more memory efficient than using apply. You can find all the functions in that package in Bengtsson (2025) .
Using matA and matB,
matA <- matrix(1:16, 4, 4)
matB <- matrix(101:116, 4, 4)matA to matB:matA + matB [,1] [,2] [,3] [,4]
[1,] 102 110 118 126
[2,] 104 112 120 128
[3,] 106 114 122 130
[4,] 108 116 124 132
matA with matB:matA * matB [,1] [,2] [,3] [,4]
[1,] 101 525 981 1469
[2,] 204 636 1100 1596
[3,] 309 749 1221 1725
[4,] 416 864 1344 1856
matB:log(matB[, 2])[1] 4.653960 4.663439 4.672829 4.682131
matA if these values follow a normal distribution with mean 2 and standard deviation 1.5. What do you expect for the value matA[2, 1] = 2?pnorm(matA[, 1], 2, 1.5)[1] 0.2524925 0.5000000 0.7475075 0.9087888
matB:colMeans(matB)[1] 102.5 106.5 110.5 114.5
colSums(matB)[1] 410 426 442 458
matArowMeans(matA)[1] 7 8 9 10
rowSums(matA)[1] 28 32 36 40
matB:scale(matB, center = TRUE, scale = TRUE) [,1] [,2] [,3] [,4]
[1,] -1.1618950 -1.1618950 -1.1618950 -1.1618950
[2,] -0.3872983 -0.3872983 -0.3872983 -0.3872983
[3,] 0.3872983 0.3872983 0.3872983 0.3872983
[4,] 1.1618950 1.1618950 1.1618950 1.1618950
attr(,"scaled:center")
[1] 102.5 106.5 110.5 114.5
attr(,"scaled:scale")
[1] 1.290994 1.290994 1.290994 1.290994
matA:scale(matA, center = TRUE, scale = FALSE) [,1] [,2] [,3] [,4]
[1,] -1.5 -1.5 -1.5 -1.5
[2,] -0.5 -0.5 -0.5 -0.5
[3,] 0.5 0.5 0.5 0.5
[4,] 1.5 1.5 1.5 1.5
attr(,"scaled:center")
[1] 2.5 6.5 10.5 14.5
Using the apply() function and simplifying your results:
matB:apply(matB, MARGIN = 2, FUN = quantile, simplify = TRUE) [,1] [,2] [,3] [,4]
0% 101.00 105.00 109.00 113.00
25% 101.75 105.75 109.75 113.75
50% 102.50 106.50 110.50 114.50
75% 103.25 107.25 111.25 115.25
100% 104.00 108.00 112.00 116.00
matB:apply(matB, MARGIN = 1, FUN = quantile, simplify = TRUE) [,1] [,2] [,3] [,4]
0% 101 102 103 104
25% 104 105 106 107
50% 107 108 109 110
75% 110 111 112 113
100% 113 114 115 116
matA (what do you expect?)apply(matA, MARGIN = 2, FUN = which.min, simplify = TRUE)[1] 1 1 1 1
matA by subtracting the minimum of that row and dividing the the difference between the maximum and mimimum for that rowapply(matA, 1, function(x) (x - mean(x))/(max(x) - min(x)), simplify = TRUE) [,1] [,2] [,3] [,4]
[1,] -0.5000000 -0.5000000 -0.5000000 -0.5000000
[2,] -0.1666667 -0.1666667 -0.1666667 -0.1666667
[3,] 0.1666667 0.1666667 0.1666667 0.1666667
[4,] 0.5000000 0.5000000 0.5000000 0.5000000
matB the median value for the column and divide by the standard deviation of the column:apply(matB, 2, \(x) (x - median(x)/sd(x)), simplify = TRUE) [,1] [,2] [,3] [,4]
[1,] 21.60384 22.50545 23.40707 24.30868
[2,] 22.60384 23.50545 24.40707 25.30868
[3,] 23.60384 24.50545 25.40707 26.30868
[4,] 24.60384 25.50545 26.40707 27.30868
matB if its mean is different from 101 at the 5% level using Student’s t-distribution with 3 degree of freedom. Show the resuls as TRUE is the mean is different and FALSE otherwise. Do so in one line of code within the apply() function.apply(matB, 2, function(x) (pt(mean(x)-101, df = 3, lower.tail = FALSE)) <= 0.05)[1] FALSE TRUE TRUE TRUE
Using matC
vec1 <- sample(c(letters, LETTERS), 16)
vec2 <- sample(c(letters, LETTERS), 16)
matC <- matrix(paste(vec1, vec2, sep = "_"), 8, 2)
matC [,1] [,2]
[1,] "V_O" "d_i"
[2,] "W_u" "M_R"
[3,] "R_c" "t_L"
[4,] "q_W" "s_M"
[5,] "S_F" "i_P"
[6,] "k_r" "Y_Q"
[7,] "c_t" "f_K"
[8,] "e_G" "H_s"
apply(matC, 2, function(x) sum(grepl(pattern = "[A-Z]_[a-z]", x)), simplify = TRUE)[1] 2 1
apply(matC, 1, function(x) sum(grepl(pattern = "[a-z]_[a-zA-Z]", x)), simplify = TRUE)[1] 1 0 1 2 1 1 2 1
Using matD and matE
matD <- matrix(rnorm(5, 5, 10), 5, 1)
matE <- matrix(rnorm(5, 5, 10), 5, 1)matD:t(matD) [,1] [,2] [,3] [,4] [,5]
[1,] 13.75133 -0.1315757 1.75805 -8.878751 -1.445147
matD and matEt(matD) %*% matE [,1]
[1,] -205.5606
matD, multiply the transpose of this matrix with itself, divide by the number of rows - 1 and take the square root:matDscale <- scale(matD, center = TRUE, scale = FALSE)
sqrt((t(matDscale) %*% matDscale)/(nrow(matD) - 1)) [,1]
[1,] 8.185649
matD:sd(matD)[1] 8.185649
What do you see if you compare the outcome of the last two calculation? Do you know why that is the case?
Vectors are uni-dimensional. Matrices are two-dimensional. Arrays allow you to store data in more than two dimensions. You can think of arrays as a series of matrices of the same dimensions. Like matrices, they are homogeneous: all values in an array have the same type. In other words, matrices are a special case of arrays: they are arrays with 1 matrix.
To create an array, you can use the array() function. This function needs the data to be stored in a array and the dimensions of the array stored in a vector. Here, you need three: nrow, ncol and nmat. The array() function read the data and determines the dimension from c(nrow, ncol, nmat). For stance, to create an array with 2 3x3 matrices, you can use
arr <- array(1:18, c(3, 3, 2))
arr, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
The array arr included 2 matrices. Both are 3x3 matrices. The first includes the values 1 to 9 and the second the values 10 to 18. Note that R stores the values first by matrix and then by column within a matrix.
To see if arr is an array, you check its class
class(arr)[1] "array"
In addition, to see the type of values of an array, you can use
typeof(arr)[1] "integer"
In this case, R read the values as integers. As an alternative, you can verify if arr is an array using:
is.array(arr)[1] TRUE
You can check the dimensions of an array using
dim(arr)[1] 3 3 2
Let’s see what happens if the data in the array() function has less values than the number of elements in the array:
arr <- array(1:6, c(3, 3, 2))
arr, , 1
[,1] [,2] [,3]
[1,] 1 4 1
[2,] 2 5 2
[3,] 3 6 3
, , 2
[,1] [,2] [,3]
[1,] 4 1 4
[2,] 5 2 5
[3,] 6 3 6
As was the case with matrices, R uses some values in the data more than once. Here, the first matrix is filled by column. As the data only includes numeric values from 1 to 6, uses the first three values of the data, 1 to 3, again to fill the last column of the first matrix. To fill the second matrix, R continues with the values 4 to 6 to store the first column of the second matrix. To fill the last 2 columns of the second matrix, R uses the data a third time.
If the data include more values then there are elements in the array, e.g.
arr <- array(1:24, c(3, 3, 2))
arr, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
R uses the values it needs to store the array and leave all other out. Here, there are 24 values to store in an 18 element array. R uses only the first 18.
Suppose you have two matrices, matc1 and matc2
matc1 <- matrix(1:9, 3, 3)
matc2 <- matrix(11:19, 3, 3)You can collect these into an array using cbind(). The function creates a new matrix adding the columns of matc2 to those of matc1. To fill the array when the data is a matrix, R starts to fill the array with the elements in all rows on the first column, then moves to all rows of the second column, … .
cbind(matc1, matc2) [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 4 7 11 14 17
[2,] 2 5 8 12 15 18
[3,] 3 6 9 13 16 19
In other words, to fill array from 2 3x3 matrices, you can use:
arrc <- array(cbind(matc1, matc2), c(3, 3, 2))
arrc, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 11 14 17
[2,] 12 15 18
[3,] 13 16 19
You can add names to the rows, columns and matrices in an array. You can do so in a number of ways. First you can include the names in a list in the the array() function. This lists first show the row names, than the column names followed by the matrix names:
arr <- array(1:18, c(3, 3, 2),
dimnames = list(c("year_1", "year_2", "year_3"),
c("var_1", "var_2", "var_3"),
c("firm_1", "firm_2")))
arr, , firm_1
var_1 var_2 var_3
year_1 1 4 7
year_2 2 5 8
year_3 3 6 9
, , firm_2
var_1 var_2 var_3
year_1 10 13 16
year_2 11 14 17
year_3 12 15 18
To verify the names of an array, you can use dimnames(arr):
dimnames(arr)[[1]]
[1] "year_1" "year_2" "year_3"
[[2]]
[1] "var_1" "var_2" "var_3"
[[3]]
[1] "firm_1" "firm_2"
Here, R returns a list. To extract the names for the rows, columns or matrices, you add [1], [2] or [3]. Doing so, you extract list with their names. For instance, for the rows:
dimnames(arr)[1][[1]]
[1] "year_1" "year_2" "year_3"
Second, you can add the names to an existing array. To do so for arrc, you can use
dimnames(arrc) <- list(c("year_1", "year_2", "year_3"),
c("var_1", "var_2", "var_3"),
c("firm_1", "firm_2"))You can verify these names in a similar way
dimnames(arrc)[[1]]
[1] "year_1" "year_2" "year_3"
[[2]]
[1] "var_1" "var_2" "var_3"
[[3]]
[1] "firm_1" "firm_2"
The dimensions of an array and the dimnames are attributes of an array. To see this, we can ask R to show the attributes of arrc:
attributes(arrc)$dim
[1] 3 3 2
$dimnames
$dimnames[[1]]
[1] "year_1" "year_2" "year_3"
$dimnames[[2]]
[1] "var_1" "var_2" "var_3"
$dimnames[[3]]
[1] "firm_1" "firm_2"
R returns a list. To access these values, you can use, e.g.
attributes(arrc)$dim[1] 3 3 2
attributes(arrc)$dimnames[1][[1]]
[1] "year_1" "year_2" "year_3"
To subset and array, you can use an approach which is very similar to the approach for matrices and vectors: subsetting by position, by name of by logical condition.
To illustrate, we will use the following array:
vec1 <- c(111, 211, 311, 121, 221, 321, 131, 231, 331, 112, 212, 312, 122, 222, 322, 132, 232, 332)
rown <- paste("year", 1:3, sep = "_")
coln <- paste("var", 1:3, sep = "_")
matn <- paste("firm", 1:2, sep = "_")
arr <- array(vec1, c(3, 3, 2), dimnames = list(rown, coln, matn))
arr, , firm_1
var_1 var_2 var_3
year_1 111 121 131
year_2 211 221 231
year_3 311 321 331
, , firm_2
var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
The values are equal to the row number, column number and matrix number. To subset an array using index positions, you include them in [i, j, k]. The first index position refers to the row, the second to the column and the third to the matrix. For instance, to extract the value in the third row of the second column in the first matrix:
arr[3, 2, 1][1] 321
Note that R simplifies the result. In other words, [] acts as a simplifying subsetting operator. To preserve the structure, you need to add drop = FALSE. Doing so, R will keep the structure of the data:
arr[3, 2, 1, drop = FALSE], , firm_1
var_2
year_3 321
To subset one value from both matrices, you can use [i, j, ]. Here, you leave the third dimension (the matrix) open. R will show the results in a simplified way unless you add drop = FALSE. For instance, the element on the first row and first column of both matrices equals:
arr[1, 1, ]firm_1 firm_2
111 112
As you can see, R simplifies to result to a vector. Adding drop = FALSE preserves the structure of the data:
arr[1, 1, , drop = FALSE], , firm_1
var_1
year_1 111
, , firm_2
var_1
year_1 112
You can extract the values on all rows iof one column in one matrix k using `[i, , k]. For instance to subset the all values on the first row of the first matrix:
arr[1, , 1]var_1 var_2 var_3
111 121 131
[, j, k] subsets the values on all rows in column j of matrix k. For instance, to see the values for the second column of the second matrix:
arr[, 2, 2]year_1 year_2 year_3
122 222 322
If you leave two positions open, you extract
arr[, , 2, drop = FALSE], , firm_2
var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
arr[, 2, , drop = FALSE], , firm_1
var_2
year_1 121
year_2 221
year_3 321
, , firm_2
var_2
year_1 122
year_2 222
year_3 322
arr[3, , ] firm_1 firm_2
var_1 311 312
var_2 321 322
var_3 331 332
There are two ways to subset multiple row, columns or matrices from an array. The first uses the colon and subsets a range from x to y: x:y. For instance, rows 1 to 2 from column 1 and matrix 1 to 2:
arr[1:2, 1, 1:2] firm_1 firm_2
year_1 111 112
year_2 211 212
Collecting all rows, columns or matrixes you want to subset in a vector using c() allows you to subset these row, columns and matrices individually. For instance, subsetting rows 1 and 3 and columns 1 and 3 from matrices 1 and 2:
arr[c(1, 3), c(1, 3), c(1, 2)], , firm_1
var_1 var_3
year_1 111 131
year_3 311 331
, , firm_2
var_1 var_3
year_1 112 132
year_3 312 332
Using negative indices, you can subset all rows/columns/matrices except those with a negative index number. In the previous example, we extracted all values except row and column 2 from all matrices. You would do the same using negative index positions using:
arr[-2, -2, ], , firm_1
var_1 var_3
year_1 111 131
year_3 311 331
, , firm_2
var_1 var_3
year_1 112 132
year_3 312 332
You can also use the names of the row, columns and matrices to subset. To do so, you include the names in quotation marks within the subsetting operator. For instance:
arr["year_1", "var_1", "firm_1"][1] 111
arr[, "var_1", "firm_1"]year_1 year_2 year_3
111 211 311
arr["year_3", , "firm_2"]var_1 var_2 var_3
312 322 332
arr["year_3", "var_2", ]firm_1 firm_2
321 322
arr["year_3", , ] firm_1 firm_2
var_1 311 312
var_2 321 322
var_3 331 332
arr[, "var_3", ] firm_1 firm_2
year_1 131 132
year_2 231 232
year_3 331 332
arr[, , "firm_2"] var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
As you could with matrices, you can subset an array with a logical array. Let’s first create a random logical array:
cond <- array(sample(c(TRUE, FALSE), size = 18, replace = TRUE), c(3, 3, 2))
cond, , 1
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE TRUE
[3,] FALSE FALSE FALSE
, , 2
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] FALSE FALSE TRUE
[3,] FALSE FALSE TRUE
arr[cond][1] 111 221 231 122 132 232 332
You can create these logical conditions in many ways. For instance, if you want to extract all values larger then 200, you can use this condition in the subsetting operator:
arr[arr > 200] [1] 211 311 221 321 231 331 212 312 222 322 232 332
You can refine this condition. For instance, if you want to extract all values for the rows and matrices where the first column of the first matrix is larger than 200, you can define the following condition:
cond <- arr[, 1, 1] > 200
condyear_1 year_2 year_3
FALSE TRUE TRUE
As you can see, there are two values in the first column of the first matrix who are larger than 200. These values are in row 2 and 3. You can now use this condition to extract the values for rows 2 and 3 for all columns and in both matrices:
arr[cond, , ], , firm_1
var_1 var_2 var_3
year_2 211 221 231
year_3 311 321 331
, , firm_2
var_1 var_2 var_3
year_2 212 222 232
year_3 312 322 332
Recall that for a matrix, you could use grepl() to subset row or column names. With arrays, you can also subset matrix names. For instance, to extract all data (full matrix) for the matrix whose name include a digit, you can use the pattern “_[2-3]” to extract all matrices whose name end with a 2 or 3. To do so, you need the matrix names. You can extract these names using
dimnames(arr)[[1]]
[1] "year_1" "year_2" "year_3"
[[2]]
[1] "var_1" "var_2" "var_3"
[[3]]
[1] "firm_1" "firm_2"
The output of this function is a list. To extract the values of the third variable in this list, you can use the double subsetting operator [[ ]]: dimnames[[3]]. We”ll cover that operator more in depth when we discuss lists. Now you have all the information you need to extract the values:
arr[, , grepl(pattern = "_[2-3]", x = dimnames(arr)[[3]])] var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
As you could with matrices, you can change an individual value or a range of values by subsetting that value or range and reassigning a different value. For instance, to multiply all values in the second column of the first matrix with 10:
arr[, 2, 1] <- arr[, 2, 1] * 10
arr[, , 1] var_1 var_2 var_3
year_1 111 1210 131
year_2 211 2210 231
year_3 311 3210 331
To add a matrix to an array, you can use the abind() function of the {abind} package. This package is usually installed. Let’s first define a new array, arr1. We know that we will add it to arr. In other words, we can use the names of the rows and columns in arr to create the names for the rows and columns in the new array arr1. To do so, we use dimnames(arr)[[1]] for the row names and dimnames(arr)[[2]] for the column names. To be consistent with the naming of matrices, I”ll use “firm_3” for the matrix name. Using array():
arr1 <- array(c(113, 213, 313, 123, 223, 323, 133, 233, 333), c(3, 3, 1), dimnames = list(dimnames(arr)[[1]],dimnames(arr)[[2]], c("firm_3")))
arr1, , firm_3
var_1 var_2 var_3
year_1 113 123 133
year_2 213 223 233
year_3 313 323 333
The abind() function has many options. Here, we will keep all default values and add the matrix as the last matrix in the array. To do so with the abind function uses:
abind::abind(arr, arr1), , firm_1
var_1 var_2 var_3
year_1 111 1210 131
year_2 211 2210 231
year_3 311 3210 331
, , firm_2
var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
, , firm_3
var_1 var_2 var_3
year_1 113 123 133
year_2 213 223 233
year_3 313 323 333
As you can see, the array has 3 matrices: firm_1, firm_2 and firm_3. Here, I used an array, but you can also add a matrix.
A second way starts from the deconstruction of the array. Recall that c() applies to a matrix turns the matrix into a vector. The same holds for an array. After deconstruction, you can append that vector with your new values for your matrix. Doing so, you have all the elements that you need to rebuild an array. For instance,
arr_new <- array(cbind(c(arr), c(113, 213, 313, 123, 223, 323, 133, 233, 333)), c(3, 3, 3), dimnames = list(dimnames(arr)[[1]],dimnames(arr)[[2]], c(dimnames(arr)[[3]], "firm_3")))
arr_new, , firm_1
var_1 var_2 var_3
year_1 111 1210 131
year_2 211 2210 231
year_3 311 3210 331
, , firm_2
var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
, , firm_3
var_1 var_2 var_3
year_1 113 123 133
year_2 213 223 233
year_3 313 323 333
To add rows or columns to the matrices, you can first collect them in a separate matrix:
mat_1 <- arr[, , 1]
mat_2 <- arr[, , 2]Using cbind() or rbind() you can now add new rows or columns. For instance, let’s add c(411, 4210, 431) to the first matrix and c(412, 422, 432) to the second:
mat_1 <- rbind(mat_1, c(411, 4210, 431))
mat_2 <- rbind(mat_2, c(412, 422, 432))You can now change the array arr:
arr <- array(cbind(mat_1, mat_2), c(4, 3, 2),dimnames = list(c(dimnames(arr)[[1]], "year_4"),dimnames(arr)[[2]], c(dimnames(arr)[[3]])))
arr, , firm_1
var_1 var_2 var_3
year_1 111 1210 131
year_2 211 2210 231
year_3 311 3210 331
year_4 411 4210 431
, , firm_2
var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
year_4 412 422 432
The fourth row is now added to both matrices.
Removing parts of an array can be done using negative indices. However, in that case, you need to make sure that the dimensions of the various matrices stay equal. For instance, to remove the fourth row from all matrices in arr:
arr[-4, , ], , firm_1
var_1 var_2 var_3
year_1 111 1210 131
year_2 211 2210 231
year_3 311 3210 331
, , firm_2
var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
To remove a matrix (e.g. the third) from arr_new:
arr_new[, , -3], , firm_1
var_1 var_2 var_3
year_1 111 1210 131
year_2 211 2210 231
year_3 311 3210 331
, , firm_2
var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
As most functions are vectorized, most will apply to each element of an array. For instance
log(arr), , firm_1
var_1 var_2 var_3
year_1 4.709530 7.098376 4.875197
year_2 5.351858 7.700748 5.442418
year_3 5.739793 8.074026 5.802118
year_4 6.018593 8.345218 6.066108
, , firm_2
var_1 var_2 var_3
year_1 4.718499 4.804021 4.882802
year_2 5.356586 5.402677 5.446737
year_3 5.743003 5.774552 5.805135
year_4 6.021023 6.045005 6.068426
arr^2, , firm_1
var_1 var_2 var_3
year_1 12321 1464100 17161
year_2 44521 4884100 53361
year_3 96721 10304100 109561
year_4 168921 17724100 185761
, , firm_2
var_1 var_2 var_3
year_1 12544 14884 17424
year_2 44944 49284 53824
year_3 97344 103684 110224
year_4 169744 178084 186624
sqrt(arr), , firm_1
var_1 var_2 var_3
year_1 10.53565 34.78505 11.44552
year_2 14.52584 47.01064 15.19868
year_3 17.63519 56.65686 18.19341
year_4 20.27313 64.88451 20.76054
, , firm_2
var_1 var_2 var_3
year_1 10.58301 11.04536 11.48913
year_2 14.56022 14.89966 15.23155
year_3 17.66352 17.94436 18.22087
year_4 20.29778 20.54264 20.78461
exp(arr), , firm_1
var_1 var_2 var_3
year_1 1.609487e+48 Inf 7.808671e+56
year_2 4.326490e+91 Inf 2.099062e+100
year_3 1.163011e+135 Inf 5.642525e+143
year_4 3.126310e+178 Inf 1.516777e+187
, , firm_2
var_1 var_2 var_3
year_1 4.375039e+48 9.636666e+52 2.122617e+57
year_2 1.176062e+92 2.590449e+96 5.705843e+100
year_3 3.161392e+135 6.963429e+139 1.533797e+144
year_4 8.498192e+178 1.871851e+183 4.123027e+187
After subsetting the appropriate matrix, you can apply these function to one or more matrices. If you reassign their values, these matrices will also change in the array:
arr[, , 1] <- log(arr[, , 1])
arr, , firm_1
var_1 var_2 var_3
year_1 4.709530 7.098376 4.875197
year_2 5.351858 7.700748 5.442418
year_3 5.739793 8.074026 5.802118
year_4 6.018593 8.345218 6.066108
, , firm_2
var_1 var_2 var_3
year_1 112 122 132
year_2 212 222 232
year_3 312 322 332
year_4 412 422 432
You can calculate the column means and column sums (or their equivalent row function) using colMeans(). When we introduced this function for matrices, we disregarded the dims argument. Here this argument plays a role. dims = 1 shows the means per column and per matrix:
colMeans(arr_new, dims = 1) firm_1 firm_2 firm_3
var_1 211 212 213
var_2 2210 222 223
var_3 231 232 233
Changing this into dims = 2 calculated means for all values per matrix:
colMeans(arr_new, dims = 2)firm_1 firm_2 firm_3
884 222 223
Whether you need the first or the second option, depends on the data in the matrices. Here, if matrices refer to firms, variables to e.g. revenue, profit or market capitalization and the rows to years, an average across all variables per firm doesn’t make sense. However, if you data refers to measurements (e.g. temperature) per hour and location where each matrix is a day, an average across all measurements per day does make sense: it is the average daily temperature in e.g. a country.
colSums, rowSums and rowMeans work in a similar way.
The apply() function with MARGIN = 2 applies a function FUN to all columns of an array. For instance, the average for the all the columns across all matrices in arr_new can be calculated as
apply(arr_new, 2, mean)var_1 var_2 var_3
212 885 232
To use the apply function per matrix, you’ll have to write a for loop. We will discuss loops more in depth in Chapter 13, but the overall setup of a loop is straightforward. The first part if for (i in c(1, 2, 3)). Here i will first take the first value in c(1, 2, 3) i.e. i will be 1? The second part of the loop includes the statement that R needs to execute. For instance: k <- i^2. R will calculate the square of k and assign it to k. If R finishes with the code, it moves back to i in c(1, 2, 3) and changes to value from 1 in 2. It now executes the code with i = 2. Here, we use the fact that we can determine the number of matrices from dim(arr) The third position in that vector shows the number of matrices. This allows us to determine how many loops the for loop will make. The code R needs to execute is the apply() function. All we need to do is store the results in a separate matrix. With respect to the dimensions: the apply functio will generate a mean for every variable and for every matrix. If you store the means per matrix in a separate row, we need as many columns in the matrix as we have columns in the array and as many rows as there are matrices in the array. We are now in a position to write the loop. First we create the matrix for the results:
nc <- dim(arr_new)[2]
nr <- dim(arr_new)[3]
matrix_mean <- matrix(0, nr, nc)
# add column names and row names
# column names are the names in the array
# row names are the names of the matrices in the array
colnames(matrix_mean) <- dimnames(arr_new)[[2]]
rownames(matrix_mean) <- dimnames(arr_new)[[3]]We can use this matrix to store the results as we apply the apply() function across all matrices in the array:
for (i in 1:dim(arr_new)[3]) {
matrix_mean[i, ] <- apply(arr_new[, , i], 2, mean)
}To see the results for the mean per variable and per matrix, you can check:
matrix_mean var_1 var_2 var_3
firm_1 211 2210 231
firm_2 212 222 232
firm_3 213 223 233
In a similar way, you can use the apply() function for all other functions, including your own.
First create an 4x3x2 array (24 values) arr1 filles with c(1:24)
arr1 <- array(1:24, c(4, 3, 2))
arr1, , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
, , 2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
Using 2 4x3 matrices, mat1 and mat2, the first including c(1:12) and the second including c(13:24), create an array arr2 with these two matrices.
mat1 <- matrix(1:12, 4, 3)
mat2 <- matrix(13:24, 4, 3)
arr2 <- array(cbind(mat1, mat2), c(4, 3, 2))
arr2, , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
, , 2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
Set names for the rows (obs_1, obs_2, …), the columns (var_1, var_2, …) and the matrices (mat_1, mat_2) of arr1.
dimnames(arr1) <- list(c("obs_1", "obs_2", "obs_3", "obs_4"),
c("var_1", "var_2", "var_3"),
c("mat_1", "mat_2"))Check the attributes of arr1.
attributes(arr1)$dim
[1] 4 3 2
$dimnames
$dimnames[[1]]
[1] "obs_1" "obs_2" "obs_3" "obs_4"
$dimnames[[2]]
[1] "var_1" "var_2" "var_3"
$dimnames[[3]]
[1] "mat_1" "mat_2"
Using arr2, extract
arr2[2, 2, 2][1] 18
arr2[1, , 1][1] 1 5 9
arr2[, 3, 2][1] 21 22 23 24
arr2[1, 2, ][1] 5 17
arr2[, 1:2, 1] [,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
arr2[-1, , ], , 1
[,1] [,2] [,3]
[1,] 2 6 10
[2,] 3 7 11
[3,] 4 8 12
, , 2
[,1] [,2] [,3]
[1,] 14 18 22
[2,] 15 19 23
[3,] 16 20 24
Using names, extract the values in arr1
arr1["obs_1", "var_2", ]mat_1 mat_2
5 17
arr1[, , "mat_2"] var_1 var_2 var_3
obs_1 13 17 21
obs_2 14 18 22
obs_3 15 19 23
obs_4 16 20 24
Extract all values larger than 15 from arr2
arr2[arr2 > 15][1] 16 17 18 19 20 21 22 23 24
Create a 4x3 matrix, mat_3, filled with c(25:36)
mat_3 <- matrix(25:36, 4, 3)Add this matrix to arr2
arr2 <- abind::abind(arr2, mat_3)Remove the fourth row of each matrix in arr2.
arr2 <- arr2[-4, , ]
arr2, , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
, , 2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
, , 3
[,1] [,2] [,3]
[1,] 25 29 33
[2,] 26 30 34
[3,] 27 31 35
Remove the third matrix from arr2
arr2[, , -3], , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
, , 2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
Calculate the column means for every column in each matrix of arr1.
colMeans(arr1, dims = 1) mat_1 mat_2
var_1 2.5 14.5
var_2 6.5 18.5
var_3 10.5 22.5
Calculate the column sum for every column in each matrix of arr1.
colSums(arr1, dims = 1) mat_1 mat_2
var_1 10 58
var_2 26 74
var_3 42 90
Use the apply() function to calculate for each column in arr1 the value: (x - min(x)/(max(x)- min(x))). Write your code in such a way that you can apply it to other arrays with different dimensions. You have to write a for loop. This statement includes for (i in ...) {apply(...)}. Store the results in an array res.
res <- array(0, c(4, 3, 2))
for (i in 1:dim(arr1)[3]) {
res[, , i] <- apply(arr1[, , i], 2, function(x) (x - min(x))/(max(x) - min(x)))
}
res, , 1
[,1] [,2] [,3]
[1,] 0.0000000 0.0000000 0.0000000
[2,] 0.3333333 0.3333333 0.3333333
[3,] 0.6666667 0.6666667 0.6666667
[4,] 1.0000000 1.0000000 1.0000000
, , 2
[,1] [,2] [,3]
[1,] 0.0000000 0.0000000 0.0000000
[2,] 0.3333333 0.3333333 0.3333333
[3,] 0.6666667 0.6666667 0.6666667
[4,] 1.0000000 1.0000000 1.0000000
#| echo: false
#| error: false
#| message: false
#| output: false
#| warning: false
rm(arr, matc1, matc2, arr_new, arrc, matrix_mean, nc, nr)Lists are widely used in R. In the previous section we referred to lists a couple of times. For instance, str_extract_all returns a list by default. Likewise, the apply() function returns a list unless you add simplify = TRUE. The attributes of a matrix are shown in a list. Here we add more depth. With lists we move from homogeneous data structures to heterogeneous data structures. Heterogeneous datas tructures can be used to store various types of data.
Like vectors, lists are uni-dimensional. Unlike vectors, matrices or arrays, they can be used to store various data types. In a list, you store vectors, matrices, characters, formulas, plots or other lists or arrays. In other words, every element in a list can have both a different type as well as different dimensions. As a result, lists are a very flexible way of storing a wide variety of data into one data structure and are used to store, e.g. hierarchical data and to organize complete datasets, to store output from formulas or functions. For instance, the dimnames() function for arrays shows a complex data structure including the dimensions of an array as well as the names of the columns, rows and matrices. The first are numeric, the second are character variables. The first include 3 elements: the number of rows, columns and matrices while the names can be as short as one and further take one any size.
A non-nested list is a list that doesn’t include any other lists. In other words, the element of this list are e.g. matrices, vectors or character variables. Suppose you have the following data per student: the name, student number, a logical indicator for exchange students, the program in which the student is enrolled and information on the student’s courses in his or her individual program including their name, ects and lecture hours. These data are stored in various data structures:
student <- "Alice Wonderland"
studentnr <- "r00369258"
program <- "Bachelor business adminstration"
exchange = F
course <- c("Data and programming skills", "Strategic management", "Macro-economics and economic policy", "Economic sociology", "Introduction to methods for operational research")
ects <- c(6, 3, 6, 3, 3)
hours <- c(52, 26, 52, 26, 26)From the previous section, you should recognize these structures as a character variable, a character vector, a logical value and numeric vectors.
To create a list, you can use the list() function. This functions main arguments are the objects to store in the list. These objects could be named, but for now, we’ll add no names. We can add all these structures to a list using:
stud1 <- list(student, studentnr, program, exchange, course, ects, hours)Let’s first inspect the structure of this list using str():
str(stud1)List of 7
$ : chr "Alice Wonderland"
$ : chr "r00369258"
$ : chr "Bachelor business adminstration"
$ : logi FALSE
$ : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
$ : num [1:5] 6 3 6 3 3
$ : num [1:5] 52 26 52 26 26
Here, you can see that this list as 7 elements: 4 with type character, 2 with type numeric and 1 with type logical. As you can see, lists can store elements with various types. We can also inspect the list by printing it:
stud1[[1]]
[1] "Alice Wonderland"
[[2]]
[1] "r00369258"
[[3]]
[1] "Bachelor business adminstration"
[[4]]
[1] FALSE
[[5]]
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Economic sociology"
[5] "Introduction to methods for operational research"
[[6]]
[1] 6 3 6 3 3
[[7]]
[1] 52 26 52 26 26
Here, you see that stud1 has two levels: the first is the level of the 7 elements in that list, the second level are the individual elements of each of the 7 elements. The highest hierarchy is shown with double square brackets [[ ]]. The second level is shown with one square bracket [ ] You can verify the class and type of stud1 using
class(stud1)[1] "list"
typeof(stud1)[1] "list"
As you can see, both show “list”. You can determine the number of components in a list using the length() function. Here, stud1 has 7 components. To check this, you can use
length(stud1)[1] 7
A lot of the functions that we saw in the previous sections that return a list, return a non named list. For instance ’str_extract_all() returns
char <- c("Fair if foul and foul is fair.", "Hover through the fog and filthy air.")
stringr::str_extract_all(char, pattern = "fair|fog|filthy")[[1]]
[1] "fair"
[[2]]
[1] "fog" "filthy"
They do this because the results of these function is often not compatible with a matrix or vector. For instance, here, you have two matches: one with 1 element (fair) and one with 2 elements (fog and filthy both appear in the second element of the character vector). To store these results, you need a list.
You can add a name to the elements of a list by adding them in the list() function. For instance:
stud1 <- list(name = student,
number = studentnr,
program = program,
exchange = exchange,
course = course,
hours = hours,
ects = ects)If you check the structure of the list, you can now see the names of that list:
str(stud1)List of 7
$ name : chr "Alice Wonderland"
$ number : chr "r00369258"
$ program : chr "Bachelor business adminstration"
$ exchange: logi FALSE
$ course : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
$ hours : num [1:5] 52 26 52 26 26
$ ects : num [1:5] 6 3 6 3 3
Printing the list also reveals their names
stud1$name
[1] "Alice Wonderland"
$number
[1] "r00369258"
$program
[1] "Bachelor business adminstration"
$exchange
[1] FALSE
$course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Economic sociology"
[5] "Introduction to methods for operational research"
$hours
[1] 52 26 52 26 26
$ects
[1] 6 3 6 3 3
To extract the names in the list, you can use names().
names(stud1)[1] "name" "number" "program" "exchange" "course" "hours" "ects"
Some function in R return a named list. For instance the attributes() function shows the attributes of a vector or a matrix as a names list:
attributes(matrix(c(10, 20, 30, 40), 2, 2, dimnames = list(c("obs1", "obs2"), c("var1", "var2"))))$dim
[1] 2 2
$dimnames
$dimnames[[1]]
[1] "obs1" "obs2"
$dimnames[[2]]
[1] "var1" "var2"
Again note that attibutes() returns a list as it wouldn’t be possible to show that result otherwise as it mixes characters and numeric values.
Inside a list, you can have lists. In that case, lists are nested. let’s add two new students and store their data in lists stud2 and stud3:
student <- "Bart Vader"
studentnr <- "r00362958"
program <- "Bachelor business adminstration"
exchange = F
course <- c("Data and programming skills", "Strategic management", "Macro-economics and economic policy", "Financial statement analysis", "Entrepreneurship and business planning")
ects <- c(6, 3, 6, 6, 3)
hours <- c(52, 26, 52, 52, 26)
stud2 <- list(name = student,
number = studentnr,
program = program,
exchange = exchange,
course = course,
hours = hours,
ects = ects)
student <- "Clark Kent"
studentnr <- "r00362478"
program <- "Bachelor business adminstration"
exchange = T
course <- c("Macro-economics and economic policy", "Economic sociology", "Entrepreneurship and business planning", "Financial accouing B", "Mathematics for business B")
ects <- c(6, 3, 3, 3, 3)
hours <- c(52, 26, 26, 26, 26)
stud3 <- list(name = student,
number = studentnr,
program = program,
exchange = exchange,
course = course,
hours = hours,
ects = ects)Using list() we can add these three students in one list and give each list a name
allstud <- list(student1 = stud1,
student2 = stud2,
student3 = stud3)Note that the three lists here include the same components. However, this is not necessary. A nested list can include lists with various components.
From the structure of the list
str(allstud)List of 3
$ student1:List of 7
..$ name : chr "Alice Wonderland"
..$ number : chr "r00369258"
..$ program : chr "Bachelor business adminstration"
..$ exchange: logi FALSE
..$ course : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
..$ hours : num [1:5] 52 26 52 26 26
..$ ects : num [1:5] 6 3 6 3 3
$ student2:List of 7
..$ name : chr "Bart Vader"
..$ number : chr "r00362958"
..$ program : chr "Bachelor business adminstration"
..$ exchange: logi FALSE
..$ course : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Financial statement analysis" ...
..$ hours : num [1:5] 52 26 52 52 26
..$ ects : num [1:5] 6 3 6 6 3
$ student3:List of 7
..$ name : chr "Clark Kent"
..$ number : chr "r00362478"
..$ program : chr "Bachelor business adminstration"
..$ exchange: logi TRUE
..$ course : chr [1:5] "Macro-economics and economic policy" "Economic sociology" "Entrepreneurship and business planning" "Financial accouing B" ...
..$ hours : num [1:5] 52 26 26 26 26
..$ ects : num [1:5] 6 3 3 3 3
you can now see that this list has 3 levels: the first includes the three lists for every student. The second level shows the list per student and the third level includes the individual values for each list component. These last two levels coincide with the components of stud1, stud2 and stud3. You could add more lists. For instance, you could define a list with course data including the course, the hours and ects vectors and store these vectors in a seperate list. In that case, you would add a hierarchy.
Here, the function names() returns the names of the highest hierarchy:
names(allstud)[1] "student1" "student2" "student3"
and length() shows the number of components in the highest hierarchy:
length(allstud)[1] 3
A special case of lists are plots. Recall the plots with the random draws from various distributions, e.g.
hist(v_norm <- rnorm(n = 100, mean = 0, sd = 1),
probability = TRUE,
col = "lightblue",
border = "white",
xlab = "Value",
main = "Normal")You can assign this plot to an object, plot_norm:
plot_norm <- hist(v_norm <- rnorm(n = 100, mean = 0, sd = 1),
probability = TRUE,
col = "lightblue",
border = "white",
xlab = "Value",
main = "Normal")Now check the type of this plot
typeof(plot_norm)[1] "list"
As you can see, this plot is stored as a list. In other words, if you store plots in a list, you are using nested lists.
The function unlist(x, recursive = TRUE, use.names = TRUE) simplifies the list structure the returns all the individual components of the list. The option recurive = TRUE by default will apply this function to all components of the list. With nested lists, this default option unlists all lists within the list. The last option use.names = TRUE by default preserves the names. To see what this function does, let’s apply it to stud1. As we don’t have any lists within stud1 the option recursive is not applicable. Unlisting stud1 returns:
unlist(stud1) name
"Alice Wonderland"
number
"r00369258"
program
"Bachelor business adminstration"
exchange
"FALSE"
course1
"Data and programming skills"
course2
"Strategic management"
course3
"Macro-economics and economic policy"
course4
"Economic sociology"
course5
"Introduction to methods for operational research"
hours1
"52"
hours2
"26"
hours3
"52"
hours4
"26"
hours5
"26"
ects1
"6"
ects2
"3"
ects3
"6"
ects4
"3"
ects5
"3"
The output shows all individual components. Note that e.g. course, which is a character vector, is simplfied to its individual elements. R labels these elements as e.g. course1, course2, … . Likewise, hours, a numeric vector, is shown as individual elements with name hours1, hours2, … .
Applied to allstud, a nested list and using recursive = FALSE, returns the individual components of the three lists as one long list. The names of the highest hierarchy in addstud is used to construct names. Using unlist(allstud, recursive = TRUE) returns:
unlist(allstud, recursive = FALSE)$student1.name
[1] "Alice Wonderland"
$student1.number
[1] "r00369258"
$student1.program
[1] "Bachelor business adminstration"
$student1.exchange
[1] FALSE
$student1.course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Economic sociology"
[5] "Introduction to methods for operational research"
$student1.hours
[1] 52 26 52 26 26
$student1.ects
[1] 6 3 6 3 3
$student2.name
[1] "Bart Vader"
$student2.number
[1] "r00362958"
$student2.program
[1] "Bachelor business adminstration"
$student2.exchange
[1] FALSE
$student2.course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Financial statement analysis"
[5] "Entrepreneurship and business planning"
$student2.hours
[1] 52 26 52 52 26
$student2.ects
[1] 6 3 6 6 3
$student3.name
[1] "Clark Kent"
$student3.number
[1] "r00362478"
$student3.program
[1] "Bachelor business adminstration"
$student3.exchange
[1] TRUE
$student3.course
[1] "Macro-economics and economic policy"
[2] "Economic sociology"
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"
[5] "Mathematics for business B"
$student3.hours
[1] 52 26 26 26 26
$student3.ects
[1] 6 3 3 3 3
You can see that this is a list if you use is.list():
is.list(unlist(allstud, recursive = FALSE))[1] TRUE
Note that all names include a dot “.” to separate the name of the list (e.g. student1) and the name of the component (e.g. name). Using names() you can select the names:
names(unlist(allstud, recursive = FALSE)) [1] "student1.name" "student1.number" "student1.program"
[4] "student1.exchange" "student1.course" "student1.hours"
[7] "student1.ects" "student2.name" "student2.number"
[10] "student2.program" "student2.exchange" "student2.course"
[13] "student2.hours" "student2.ects" "student3.name"
[16] "student3.number" "student3.program" "student3.exchange"
[19] "student3.course" "student3.hours" "student3.ects"
In case of nested lists, the default recursive = TRUE will simplify every list in the nested list. In other words, the output will be similar to the one for unlisting unnested lists.
To subset a list, you can use index positions using both the [] subsetting operator as well as the double square brackets operator [[]]. Let’ start with the first: [] and extract the first element of stud1, the list with the data on the first student Alice Wonderland:
stud1[1]$name
[1] "Alice Wonderland"
As you can see, this operator returns the first component of stud1 and does so as a list. In other words, [] preserves the structure of the data. You can see this from the output (which refers to the $name) as well as from the class of the output:
class(stud1[1])[1] "list"
The double square brackets [[]]are a simplifying operator. They simplify the result as much as possible e.g. to a numeric vector, a character vector, a logical value … . For instance, let’s use the [[]] to extract the first element of stud1:
stud1[[1]][1] "Alice Wonderland"
Recall that the preserving subsetting operator returned a list, here R simplifies to a character variable.
class(stud1[[1]])[1] "character"
Let’s now subset the sixth element of stud1, the hours for each course. Using the single square brackets, R returns a list:
stud1[6]$hours
[1] 52 26 52 26 26
class(stud1[6])[1] "list"
while the the simplifying operator returns a numeric vector:
stud1[[6]][1] 52 26 52 26 26
is.vector(stud1[[6]])[1] TRUE
class(stud1[[6]])[1] "numeric"
To subset this vector, you start from the simplifying operator. As this operator creates a vector, you can now use the subsetting rules for a vector. Here, the vector you subset is stud1[[6]]. To subset the first element, you add [1]:
stud1[[6]][1][1] 52
You can now use all subsetting rules for vectors, e.g.
stud1[[6]][1:4][1] 52 26 52 26
stud1[[6]][-1][1] 26 52 26 26
stud1[[6]][stud1[[6]] > 30][1] 52 52
If the list is named, you can also use the names and add them between quotation marks in the preserving subsetting operator [] or the simplifying operator [[]]. The first returns a list, the second simplifies to output. To extract the name of the student in stud2 and return a list, you can use:
stud2["name"]$name
[1] "Bart Vader"
Simplifying this result can be done using the simplifying subsetting operator [[]]:
stud2[["name"]][1] "Bart Vader"
You can extract the value of a list and simplify the result also in a second way: you add the name of the component after the name of the list separated by the $ subsetting operator: name_of_list$name_of_element. Doing so, R simplifies the results. For instance, to subset the component ects from the list stud2, you can use:
stud2$ects[1] 6 3 6 6 3
Here, the output is simplified to a vector. In other words, stud2$ects returns the same output as stud2[["ects"]]. You can now use all subsetting methods for a vector.
stud2$ects[3][1] 6
Subsetting within an component of a list is determined by the class of that element. In the examples, R simplified to a numeric vector. If one of the elements of the list would be a matrix, you would use the subsetting rules for a matrix.
As was the case with vectors, matrices or arrays, a negative index position extracts all but the element that is in that position. For instance, extracting all element of stud2 except the first can be done using:
stud2[-1]$number
[1] "r00362958"
$program
[1] "Bachelor business adminstration"
$exchange
[1] FALSE
$course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Financial statement analysis"
[5] "Entrepreneurship and business planning"
$hours
[1] 52 26 52 52 26
$ects
[1] 6 3 6 6 3
To extract multiple values, you combine them via c(). For instance, to extract the first and third element of the list stud3, you add these to the preserving operator []
stud3[c(1, 3)]$name
[1] "Clark Kent"
$program
[1] "Bachelor business adminstration"
Note that in this case the simplifying operator doesn’t work: the output includes heterogeneous variable types. With named elements, you can also include the names of these elements:
stud3[c("name", "number")]$name
[1] "Clark Kent"
$number
[1] "r00362478"
Using negative index position, you can extract all but the elements with the negative index position. For instance, extracting all elements from stud3 except the first and third:
stud3[c(-1, -3)]$number
[1] "r00362478"
$exchange
[1] TRUE
$course
[1] "Macro-economics and economic policy"
[2] "Economic sociology"
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"
[5] "Mathematics for business B"
$hours
[1] 52 26 26 26 26
$ects
[1] 6 3 3 3 3
You can also use logical values to subset a list. For instance:
stud1[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)]$name
[1] "Alice Wonderland"
$exchange
[1] FALSE
$ects
[1] 6 3 6 3 3
This allows you to extract e.g. components of a list using patterns in a name. For instance, extracting a component that includes the pattern “ects” can be done using grepl() where this function searches in the vector names(stud1) for a match with the pattern “ects”:
stud1[grepl(pattern = "ects", names(stud1))]$ects
[1] 6 3 6 3 3
Recall that nested lists are lists that include other lists as their elements. How do you subset a list with lists? Let’s first use index positions. Using [] returns a list. For instance,
allstud[1]$student1
$student1$name
[1] "Alice Wonderland"
$student1$number
[1] "r00369258"
$student1$program
[1] "Bachelor business adminstration"
$student1$exchange
[1] FALSE
$student1$course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Economic sociology"
[5] "Introduction to methods for operational research"
$student1$hours
[1] 52 26 52 26 26
$student1$ects
[1] 6 3 6 3 3
returns the first list, stud1 but the output keeps all references to e.g. the name of stud1 within the list allstud. Simplifying using the [[]] operator removes part of the structure of stud1, e.g. the reference to $student1 but the results are still a list.
allstud[[1]]$name
[1] "Alice Wonderland"
$number
[1] "r00369258"
$program
[1] "Bachelor business adminstration"
$exchange
[1] FALSE
$course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Economic sociology"
[5] "Introduction to methods for operational research"
$hours
[1] 52 26 52 26 26
$ects
[1] 6 3 6 3 3
Note that this shouldn’t be surprising as stud1 is a list and [[]] returns the most simplified version of this list: which is in this case a list nested in another list. As an alternative to the index position, you can also refer to the name of the list you want to extract. Adding that name to the preservering subsetting operator will extract the list while preserving the structure of the list. For instance, extracting the second list:
allstud["student2"]$student2
$student2$name
[1] "Bart Vader"
$student2$number
[1] "r00362958"
$student2$program
[1] "Bachelor business adminstration"
$student2$exchange
[1] FALSE
$student2$course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Financial statement analysis"
[5] "Entrepreneurship and business planning"
$student2$hours
[1] 52 26 52 52 26
$student2$ects
[1] 6 3 6 6 3
Doing so with the simplifying operator returns the original list:
allstud[["student2"]]$name
[1] "Bart Vader"
$number
[1] "r00362958"
$program
[1] "Bachelor business adminstration"
$exchange
[1] FALSE
$course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Financial statement analysis"
[5] "Entrepreneurship and business planning"
$hours
[1] 52 26 52 52 26
$ects
[1] 6 3 6 6 3
Let’s now move one step lower in the hierarchy. If you want to extract e.g. the name of student1, you first extract the first list using the simplifying operator. Doing so, you extract the list stud1. Adding [1] extract the first index position of the list stud1
allstud[[1]][1]$name
[1] "Alice Wonderland"
while using [[1]] simplifies the output
allstud[[1]][[1]][1] "Alice Wonderland"
A second way using the names of the elements. For instance, extracting the name of the student in student1 using the preserving operator to return a list:
allstud[["student1"]]["name"]$name
[1] "Alice Wonderland"
or the simplifying operator to return a character variable:
allstud[["student1"]][["name"]][1] "Alice Wonderland"
Third, recall that the $ operator acts as a simplifying operator. In other words, you can extract the first list using allstud$student1. You can now extract the elements of that list using either the presering operator [], the simplifying operator [[]] both with index positions and names as well as the $ operator. For instance to extract the values in ects:
allstud$student1[7]$ects
[1] 6 3 6 3 3
allstud$student1["ects"]$ects
[1] 6 3 6 3 3
[[]]:allstud$student1[[7]][1] 6 3 6 3 3
allstud$student1[["ects"]][1] 6 3 6 3 3
$:allstud$student1$ects[1] 6 3 6 3 3
Note that you can mix both index and named subsetting. Recall that the [[]] operator returns a list, but removes all references to the name of that list (e.g. student1). Here, you can For instance
allstud[[1]]$ects[1] 6 3 6 3 3
extracts the number of credits for student1.
allstud includes data for all students, where each student’s data is stored in a separate list. In the previous section, we subsetted data for an individual student. But what if we need similar data for each student in the list. To do that, you can use the Filter()function or use unlist() to remove the highest list level and extract the information from lists in at the second level.
Using the Filter(f, x) function (note the uppercase F), you can filter nested lists. The arguments of this function are f, a function that returns a logical vector and x a vector. The function uses f to subset x. Here, x refers to the nested list allstud. In that nested list, there are vectors such as ects, hours or course. We can use these to extract information on all students that meet a condition. This condition is defined by f. For instance, suppose that we want to extract all students whose courses are more than 21 ECTS. To calculate the total number of ECTS, we use sum(x$ects). The x here refers to the allstud. In other words, x$ects is shorthand for allstud$stduenti$ects. The condition can be written as sum(x$exts) > 21). We now also have the function f: function(x) sum(x$ects) > 21. Using this in Filter():
Filter(function(x) sum(x$ects) > 21, allstud)$student2
$student2$name
[1] "Bart Vader"
$student2$number
[1] "r00362958"
$student2$program
[1] "Bachelor business adminstration"
$student2$exchange
[1] FALSE
$student2$course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Financial statement analysis"
[5] "Entrepreneurship and business planning"
$student2$hours
[1] 52 26 52 52 26
$student2$ects
[1] 6 3 6 6 3
This function returns the list of the second student. This is the only student whose ECTS is higher than 21. Extracting all exchange students (exhange = T) can be done using:
Filter(function(x) x$exchange == T, allstud)$student3
$student3$name
[1] "Clark Kent"
$student3$number
[1] "r00362478"
$student3$program
[1] "Bachelor business adminstration"
$student3$exchange
[1] TRUE
$student3$course
[1] "Macro-economics and economic policy"
[2] "Economic sociology"
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"
[5] "Mathematics for business B"
$student3$hours
[1] 52 26 26 26 26
$student3$ects
[1] 6 3 3 3 3
You can also first use unlist to remove the highest level of the nested list. Recall that unlist() removes the upper hierarchy of a nested list and that you can collect the names for each of the components in the remaining list. Using these names, you can now extract components. To see how, let’s first store the output of unlist in a separate list:
unl_allstud <- unlist(allstud, recursive = FALSE)and extract the names
unl_allstud_names <- names(unl_allstud)let’s now try to extract all courses for every student. This is where regular expressions enter. Here you want to extract all courses. These are stored in e.g. student1.course or student2.course, i.e. a pattern “student”“digit”“.”“course”. In terms of a regular expression, this is a pattern "student\\d.course". Recall that grepl() returns a logical value TRUE is a pattern is matched. In other words, grepl(pattern = "student\\d.course", unl_allstud_names) will return TRUE is the names vector includes a names such as student1.course or student3.course. We can now use this vector to subset unl_allstud:
unl_allstud[grepl(pattern = "student\\d.course", unl_allstud_names)]$student1.course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Economic sociology"
[5] "Introduction to methods for operational research"
$student2.course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Financial statement analysis"
[5] "Entrepreneurship and business planning"
$student3.course
[1] "Macro-economics and economic policy"
[2] "Economic sociology"
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"
[5] "Mathematics for business B"
As you can see, we now have a list which includes the courses for each student. If you assign this result to a list e.g. courses, you can now subset these courses and find studenten who, e.g. took Economic sociolocy.
You can write this code shorter:
unlist(allstud, recursive = FALSE)[grepl(pattern = "student\\d.course", names(unlist(allstud, recursive = FALSE)))]$student1.course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Economic sociology"
[5] "Introduction to methods for operational research"
$student2.course
[1] "Data and programming skills"
[2] "Strategic management"
[3] "Macro-economics and economic policy"
[4] "Financial statement analysis"
[5] "Entrepreneurship and business planning"
$student3.course
[1] "Macro-economics and economic policy"
[2] "Economic sociology"
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"
[5] "Mathematics for business B"
There are three ways to change the elements in a list: first you change one of a list’s components. Second, you can add a new component and third, you can remove a component.
Changing one of the components of a non nested list is not different from changing one of the elements of a vector or matrix. Subsetting this component and reassigning its value will do just that. For instance, changing the value FALSE to TRUE in the exchange component of stud1:
stud1[4] <- TRUEas an alternative, you can also use the other subsetting operators, [[]] or $. For instance
stud1$exchange <- FALSEchanges this value back to FALSE.
To change a value in a vector, matrix or array, you would use a similar approach. For instance, changing the first element in the hours vector for student 1 from 52 in 26 uses the fact that stud1$hours is a vector. Changing the first element of this vector:
stud1$hours[1] <- 26Note that here, you can use any approach we have covered for the other data structures. In other words, you can increase the number of elements in a vector (e.g. by adding them via append() or via c()), add columns and rows to a matrix using rbind or cbind or change the number of matrices in an array.
Suppose that you would like to add the total number of hours to list in stud1, stud2 and stud3. The first approach adds a component by assigning its value to stud1[8]. Recall that stud1 includes 7 components. Adding an new components adds one component to the existing ones. This new component will be the eight component. You can define this more in general using length(stud1). Recall that this function shows the number of components in stud1. Adding one will create a new component. This procedure is safer than just using a number such as 8. Especially is you have long and complex lists, you could easily overwrite an existing component. To add to total hours we use the fact that stud1$hours is a vector. Using sum(stud1$hours) allows to add the total number of hours:
stud1[length(stud1) + 1] <- sum(stud1$hours)
stud1[8][[1]]
[1] 156
Note that stud1[8] is not named. To fix this, we can add a name total. names(stud1) is a vector. We can add an eight element to that vector using:
names(stud1)[8] <- "total"
stud1$total[1] 156
To name the component, you could again use length(stud1). However, in this case, note that you want to change the last component and not the last plus one.
The second way creates a named component. Do do so, we add the name of that component, total to stud2 using the $ operator. We can assign the total number of hours to that names component:
stud2$total <- sum(stud2$hours)
stud2$total[1] 208
The third approach uses the c() function. Here, we add the component “total” by combining it with the existing components of stud3 an assigning this new list to stud3. For stud3:
stud3 <- c(stud3, "total" = sum(stud3$hours))
stud3$total[1] 156
The fourth approach uses append(). Here, you include the list as well as the value you want to add in the function arguments: append(list, value). Using this function, you can also add the position using the after = option.
If you want to add a vector, matrix or array as a new component of the list, using the first, third and fourth approach you need to tell R you want to include the values in that structure as a structure and not as individual components. To add the former, you need to include that structure in a list() statement. For instance, to add a new a new vector semester with values c(1, 2):
stud1[length(stud1) + 1] <- list(c(1, 2))
names(stud1)[length(stud1)] <- "semester"
stud3 <- c(stud3, "semester" = list(c(1, 2)))You can now check that this component was added as a vector:
stud1$semester[1] 1 2
stud3$semester[1] 1 2
Let’s see what would happen is you didn’t include the list() statement. To do so, we’ll use a copy of stud1:
stud1_copy <- stud1
stud1_copy <- c(stud1_copy, "test" = c(200, 300))
stud1_copy$test1[1] 200
stud1_copy$test2[1] 300
rm(stud1_copy)Here, you can see that R added both values in c(200, 300) as individual elements to components it named test1 and test2. In other words, R didn’t add the vector, it added the values.
Using the second approach to add a new component doesn’t require the `list()´ statement:
stud2$semester <- c(1, 2)
stud2$semester[1] 1 2
Here, you are explicitly telling R that the values c(1, 2) have to be added to one component in the list stud2$semester.
The first approach to removing components from a list uses negative index numbers. Recall that a negative index subsets all except the negative indices. Using this approach, you assign the value of the subsetted list to the same list name. Doing to will give you a new list with the same name, but without the removed component. For instance, to remove “total” from stud3:
stud3 <- stud3[-8]
stud3$name
[1] "Clark Kent"
$number
[1] "r00362478"
$program
[1] "Bachelor business adminstration"
$exchange
[1] TRUE
$course
[1] "Macro-economics and economic policy"
[2] "Economic sociology"
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"
[5] "Mathematics for business B"
$hours
[1] 52 26 26 26 26
$ects
[1] 6 3 3 3 3
$semester
[1] 1 2
A second way to remove components is to assign them NULL. For instance, to remove the total number of hours for stud1 and stud2:
stud1$total <- NULL
stud2[8] <- NULLYou can verify that both these lists lost their component total
str(stud1)List of 8
$ name : chr "Alice Wonderland"
$ number : chr "r00369258"
$ program : chr "Bachelor business adminstration"
$ exchange: logi FALSE
$ course : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
$ hours : num [1:5] 26 26 52 26 26
$ ects : num [1:5] 6 3 6 3 3
$ semester: num [1:2] 1 2
Let’s add a new student to allstud. The data for this student are collected in a list, stud4. This list will then be added to allstud. The fourth student:
student <- "Lois Lane"
studentnr <- "r00252478"
program <- "Bachelor business adminstration"
exchange = F
course <- c("Macro-economics and economic policy", "Economic sociology", "Entrepreneurship and business planning", "Financial accouing B", "Economics of the single market")
ects <- c(6, 3, 3, 3, 6)
hours <- c(52, 26, 26, 26, 52)
stud4 <- list(name = student,
number = studentnr,
program = program,
exchange = exchange,
course = course,
hours = hours,
ects = ects)We can now add this student to allstud. To do so, we use the append() function and add stud4to allstud using append(allstud, list(stud4)). We add the name using names(allstud)[4] \<- "student4". Doing so will add the fourth student to this list
allstud <- append(allstud, list(stud4))
names(allstud)[4] <- "student4"
allstud$student4$name
[1] "Lois Lane"
$number
[1] "r00252478"
$program
[1] "Bachelor business adminstration"
$exchange
[1] FALSE
$course
[1] "Macro-economics and economic policy"
[2] "Economic sociology"
[3] "Entrepreneurship and business planning"
[4] "Financial accouing B"
[5] "Economics of the single market"
$hours
[1] 52 26 26 26 52
$ects
[1] 6 3 3 3 6
Using the append() function also allows you to specify where the new list enters. Using the argument after = 2 for instance would add student4 after the second position.
As an alternative, you can use c(allstud, list(stud4)). Doing so will add the fourth student to allstud. Here too, you will have to add names. If you cont want to add names, you can use the $ operator and add the name of the new list, e.g. adding the data for student 4 as allstud$student5:
allstud$student5 <- stud4Removing lists from a nested list follows a similar approach to the one to remove components from a list: you subset using a negative index and reassign this new list to the name of the old list or you use NULL to remove the list. For instance, to remove student5 from allstud:
allstud$student5 <- NULL
length(allstud)[1] 4
We already met the apply() function. This function was used to apply functions the rows or columns of a matrix and allows to avoid for loops. The lapply() and sapply() function are designed to apply a function to a list. lapply() returns a list. Hence, the name “l”apply: the list version of apply. sapply() simplifies the result to a vector or matrix or an array. Hence, the name “s” apply: the simplified version of lapply. Like the apply() function, both allow you to avoid loops. Most of what you do within lapply() or sapply() can be done with a loop as well. However, as with apply(), it is often more efficient to use these function.
To see how these work, let’s start from a simple example: a list with 3 numeric vectors as component:
list1 <- list(vec1 = rnorm(100, 0, 1),
vec2 = rnorm(100, 5, 10),
vec3 = rnorm(100, 10, 20))Let’s now use the lapply() function to calculate the mean of each of list1’s components. This function has a couple of arguments. First, the list that will be used to apply a function to. Second, the argument FUN, the function to be applied to each component of the list, including optional arguments, e.g. na.rm = TRUE.
The function can be a base R function or a function you include in the lapply() or sapply() call. For instance, to calculate the mean of the components of list1:
lapply(list1, mean, na.rm = TRUE)$vec1
[1] 0.0144794
$vec2
[1] 4.745767
$vec3
[1] 9.735662
Here, lapply() returns a list. Using sapply() in addition to the arguments for lapply() we can set simplify = TRUE (which is TRUE by default) and use.names = TRUE (which is TRUE by default). We will keep these default values. To calculate the mean for every component in the list:
sapply(list1, mean, na.rm = TRUE) vec1 vec2 vec3
0.0144794 4.7457673 9.7356624
As you can see, this function returns a vector.
Let’s see what it would take to write the same code with a loop:
result_mean <- matrix(0, 1, 3)
for (i in 1:3) {
result_mean[1, i] <- mean(list1[[i]])
}
colnames(result_mean) <- names(list1)
result_mean vec1 vec2 vec3
[1,] 0.0144794 4.745767 9.735662
Using sapply() you write this for loop in one line of code: sapply(list1, mean).
Like you could with the apply() function, you can define your own functions in both lapply() and sapply(). Recall that we used apply() to calculate a new value as the difference between the element in a column and the minimum to the difference between the minimum and the maximum. Using lapply() and reassigning these new values to list2:
list2 <- lapply(list1, function(x) (x - min(x))/(max(x) - min(x)))You can now verify that all values in list2 are rescaled:
lapply(list2, range)$vec1
[1] 0 1
$vec2
[1] 0 1
$vec3
[1] 0 1
Using sapply() and storing the values in mat1:
mat1 <- sapply(list1, function(x) (x - min(x))/(max(x) - min(x)))You can verify this result (recall mat1 is a matrix):
apply(mat1, 2, range) vec1 vec2 vec3
[1,] 0 0 0
[2,] 1 1 1
Let’s revisit the first line lapply(list1, function(x) (x - min(x))/(max(x) - min(x))). Here you call lapply() to apply a function to every component of a list list1. In this case, the list’s components are vectors. The function to apply is function(x) (x - min(x))/(max(x) - min(x)). R will “loop over” every component of list1 and substitute that component for x in function(x). In other words, it applies that function to list1[[1]], then to lists[[2]] … until it reaches the end of the list. lapply() stores the result for every component in a list. sapply() has a similar way of applying a function, but simplifies the result, where possible, to a vector or matrix.
Note that you can have an apply() function within an lapply() function. If the components of a list are matrices and you would like to apply a function to every column of every matrix, you can use lapply(list, function(x) apply(x, 2, fun)).
list1 included only numeric vectors. In stud1 we had a mixture of data types. Most functions such as mean() or toupper() are only defined for a specific type of data. As the list can store many types, it is often convenient to first select the components of a list with the same type. Suppose you want to calculate the totals for all numeric vectors in stud1. First, we need to extract these vectors using a logical subsetting vector. To do so we will use the sapply() function to identify which components meet a condition and define a function that returns TRUE is the condition is met and FALSE otherwise. To select the numeric values, we can use the is.numeric() function within sapply(). This function will then return for every component of the list a value TRUE is that component is numeric and FALSE if that is not the case:
cond <- sapply(stud1, \(x) is.numeric(x))
cond name number program exchange course hours ects semester
FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Here, sapply() checks for very components in stud1 is this components is numeric. In other words, it tests is.numeric(stud1[[1]]), is.numeric(stud1[[2]]) … until it reaches the last component. For every component is.numeric() returns TRUE is the component is numeric and FALSE otherwise. sapply() stores each of these outcomes in a matrix or vector, here cond. In other words, cond is a logical vector whose elements are TRUE is a component of stud1 is numeric and FALSE otherwise. We can now use this logical vector to extract the components of ´stud1` that include numeric data. To do so, you can use
stud1[cond]$hours
[1] 26 26 52 26 26
$ects
[1] 6 3 6 3 3
$semester
[1] 1 2
We now have the numeric components of stud1. Because we subsetted a list, the output is also a list. We can now use lapply() or sapply() to calculate the totals for all numeric vectors in the list stud1:
sapply(stud1[cond], function(x) sum(x)) hours ects semester
156 21 3
With nested lists, the second level in the hierarchy is a list. Suppose now that you want to add a component to each list in the nested list. To illustrate, we’ll add the total number of hours for each student as an additional component to that list. You can subset the components of the lists on the second level within the lapply() or sapply() functions. For our example: for each student, the hours are stored in allstud$studenti$hours. lapply() applies a function to all studenti lists in allstud. Using this observation, including function(x) sum(x$hours) as a function in lapply(), R will ‘loop’ over each studenti and replace x with studenti. In doing so, R calculates the total hours for each student. lapply() returns a list:
lapply(allstud, function(x) sum(x$hours))$student1
[1] 182
$student2
[1] 208
$student3
[1] 156
$student4
[1] 182
If you want to add these total hours each of the students in allstud, you can use the c() and add the component “totalhours” to each sublist in allstud. Here, I copy the result of this procedure in a new list. In the structure of this new list allstud_1´ you'll see that the component,totalhours` was added to each of the student’s list:
allstud_1 <- lapply(allstud, function(x) c(x, "totalhours" = sum(x$hours)))
str(allstud_1)List of 4
$ student1:List of 8
..$ name : chr "Alice Wonderland"
..$ number : chr "r00369258"
..$ program : chr "Bachelor business adminstration"
..$ exchange : logi FALSE
..$ course : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Economic sociology" ...
..$ hours : num [1:5] 52 26 52 26 26
..$ ects : num [1:5] 6 3 6 3 3
..$ totalhours: num 182
$ student2:List of 8
..$ name : chr "Bart Vader"
..$ number : chr "r00362958"
..$ program : chr "Bachelor business adminstration"
..$ exchange : logi FALSE
..$ course : chr [1:5] "Data and programming skills" "Strategic management" "Macro-economics and economic policy" "Financial statement analysis" ...
..$ hours : num [1:5] 52 26 52 52 26
..$ ects : num [1:5] 6 3 6 6 3
..$ totalhours: num 208
$ student3:List of 8
..$ name : chr "Clark Kent"
..$ number : chr "r00362478"
..$ program : chr "Bachelor business adminstration"
..$ exchange : logi TRUE
..$ course : chr [1:5] "Macro-economics and economic policy" "Economic sociology" "Entrepreneurship and business planning" "Financial accouing B" ...
..$ hours : num [1:5] 52 26 26 26 26
..$ ects : num [1:5] 6 3 3 3 3
..$ totalhours: num 156
$ student4:List of 8
..$ name : chr "Lois Lane"
..$ number : chr "r00252478"
..$ program : chr "Bachelor business adminstration"
..$ exchange : logi FALSE
..$ course : chr [1:5] "Macro-economics and economic policy" "Economic sociology" "Entrepreneurship and business planning" "Financial accouing B" ...
..$ hours : num [1:5] 52 26 26 26 52
..$ ects : num [1:5] 6 3 3 3 6
..$ totalhours: num 182
rm(all_stud1)Warning in rm(all_stud1): object 'all_stud1' not found
Recall that in case you add a data structure such as a vector, matrix or array, you need to include that structure in a list() statement, e.g. c(x, "semester" = list(c(1, 2))).
Using sapply() returns similar results but as a matrix and not as a list:
sapply(allstud, function(x) sum(x$hours))student1 student2 student3 student4
182 208 156 182
Note that you can not use sapply() to add a component to a list: sapply() returns a vector and not a list. In other words, you can not use it to change a list.
A second way to access the components of the lists in a nested list uses the unlist() function. Recall that we can use unlist( ,recursive = FALSE) to unlist the first level. Doing so, returns the second level as a list. Using this level, you can now use lapply() or sapply(). Let’s extract the numeric vectors from the nested list allstud and calculate their sum. In the first step, we unlist allstud with the option recursive = FALSE and store the results in a list allstud_ul:
allstud_ul <- unlist(allstud, recursive = FALSE)You can verify that we removed the first level from the allstud list. We can now proceed along the lines of the previous example:
cond <- sapply(allstud_ul, function(x) is.numeric(x))
sapply(allstud_ul[cond], function(x) sum(x))student1.hours student1.ects student2.hours student2.ects student3.hours
182 21 208 24 156
student3.ects student4.hours student4.ects
18 182 21
You can now subset this result using the familiar vector or matrix subetting operations, e.g.
hours_ects <- sapply(allstud_ul[cond], function(x) sum(x))
hours <- hours_ects[grepl(pattern = ".hours", names(hours_ects))]
hoursstudent1.hours student2.hours student3.hours student4.hours
182 208 156 182
If you want to remove the reference to hours in the names, you can use the familiar charachter functions, e.g.
names(hours) <- stringr::str_extract_all(names(hours), pattern = "student\\d", simplify = TRUE)
hoursstudent1 student2 student3 student4
182 208 156 182
The lists that are part of a nested list include data. Sometimes you need to identify patterns that occur in some but not necessarily all sublists. For instance, in the example, the students list that are components of the nested list allstud include data on the courses they took. Suppose that you need to know which student took a specific course, e.g. “Economic sociology”. Visual inspection shows that there are three students who took “Economic sociology”: student1, student3 and student4. To find these students, you look for a pattern in studenti$course. That pattern is "Economic sociology. We want to subset all studenti$course components in every student list. To do so within lapply(), we use x$course. The function that we apply for every student’s list is to subset x$course using a logical vector that equals TRUE if “Economic sociology” is part of the vector with courses and FALSE otherwise. Here, we can use grepl(). Using this function in lapply() returns a logical vector
lapply(allstud, function(x) grepl(pattern = "Economic sociology", x = x$course))$student1
[1] FALSE FALSE FALSE TRUE FALSE
$student2
[1] FALSE FALSE FALSE FALSE FALSE
$student3
[1] FALSE TRUE FALSE FALSE FALSE
$student4
[1] FALSE TRUE FALSE FALSE FALSE
We can now use that vector to subset the course vector for every student. Recall that we can use a logical vector to subset vectors. Here, we do so using x$course[grepl(pattern = "Economic sociology", x = x$course)]. Note that the first x in x = x$course refers to grepl()’s argument name, not to the list’s components. Adding all these in lapply():
lapply(allstud, function(x) x$course[grepl(pattern = "Economic sociology", x = x$course)])$student1
[1] "Economic sociology"
$student2
character(0)
$student3
[1] "Economic sociology"
$student4
[1] "Economic sociology"
The list includes all students and for student2, the component in that list is an empty character vector. In other words, this student doesn’t have this course in the course vector. Without subsetting x$course the function grepl() would show a list with logical indices.
Note that we can use lapply(allstud, function(x) grepl(pattern = "Economic sociology", x = x$course)) to subset other components in every student’s list. For instance, the length of “ects” or “hours” is equal to the length of the components in the logical vector. In other words, we can also extract the ects or hours included in the program of every student who took Economic sociology. Here, both hours and ects are the same as Economic sociology is the same course across students:
lapply(allstud, function(x) x$ects[grepl(pattern = "Economic sociology", x = x$course)])$student1
[1] 3
$student2
numeric(0)
$student3
[1] 3
$student4
[1] 3
What about components such as “name” or “number”? Their length (1) is different from the length of the subsetting logical vectors. Here we can use the fact that TRUE = 1 and FALSE = 0. We looked for one pattern “Economic sociology”. If this pattern occurs in the “course” vector, lapply(allstud, function(x) grepl(pattern = "Economic sociology", x = x$course)) shows TRUE for that position and FALSE elsewhere. Summing across TRUE and FALSE will result in 1 if the subject is included and 0 if this is not the case:
lapply(allstud, function(x) sum(grepl(pattern = "Economic sociology", x = x$course)))$student1
[1] 1
$student2
[1] 0
$student3
[1] 1
$student4
[1] 1
We can now use result to subset, e.g. the name and identify who took Economic sociology and who didn’t:
lapply(allstud, function(x) x$name[sum(grepl(pattern = "Economic sociology", x = x$course))])$student1
[1] "Alice Wonderland"
$student2
character(0)
$student3
[1] "Clark Kent"
$student4
[1] "Lois Lane"
Suppose that you want to know the distribution of the value of a stock market portfolio 30 from now. You would like to answer questions such as: what is the probability that for every euro you invest today, the (nominal) value of your portfolio will rise to e.g. euro 10 in 30 years time, what is the probability that the your portfolio will be worth 5 euro’s in 30 years time. Because you can not predict the future with certainty, you decide to run a simulation to estimate this distribution. Using the simulation, you will generate “a lot of” 30 year periods. Using these results, you try to answer your questions. You assume that stock market returns (i.e. the percentage change in the value of your portfolio) are normally distributed. The parameters of this normal distribution - the mean and the standard deviation - equal the average percentage change and the volatility. For instance, if you assume that the yearly mean is 8% and the yearly volatility if 20%, then you know that in any given year, the return will be between -12% and +28% in 68,2% of all years and will be between -32% and + 48% ion 95.4% of all years. However, you are not sure of the mean will be 8%. Some portfolio’s have a lower expected return. Usually, they also have a lower volatility. On the other hand, some portfolio’s also have a higher expected return. In that case, their volatility is higher. You also want to run look at returns per month. Doing so allows you to have 360 months in year simulation and not 30 years. In other words, you will run simualtions using the following combinations of expected return and volatility:
You store these values in a matrix, mat_data. This matrix is given:
mat_data <- matrix(c(0.48676, 0.56541, 0.64340, 0.72073, 0.79741, 3.46410, 4.54663, 5.77350, 7.14471, 8.66025)/100, nrow = 5, ncol = 2)
colnames(mat_data) <- c("exp_ret", "vol")
rownames(mat_data) <- paste("sim", 1:5, sep = "_")
mat_data exp_ret vol
sim_1 0.0048676 0.0346410
sim_2 0.0056541 0.0454663
sim_3 0.0064340 0.0577350
sim_4 0.0072073 0.0714471
sim_5 0.0079741 0.0866025
How do you run this simulation? For every monthly return - volatility combination (i.e. for every row in mat_data), you draw 360 random draws from a normal distribution. To see the total value after 360 months, you want 1 and calculate the cumulative product. To see this, not that every euro invested will be worth
\[ (1 + r_1) \] after one month, \[ (1+ r_1)(1 + r_2) \] after two months, … . In other words,
\[ (1 + r_1)(1 + r_2) ... (1 + r_{360}) \] will be the value after 360 months or 30 years.
Here you draw the r’s from the normal distribution. If you then add 1, every value will equal \(1 + r_1\). The cumulative product will then show your total value after 360 months. Here, you have one simulation but you need a “large number” of these simulation to answer you question. So, for every return - volatility combination, you generate this simulation 100 times.
To store the results, we will use a list for every return- volatility pair and call it simi where i refers to the row in mat_data. We will store the expected return and volatility as simi$exp_return and simi$volatility. Because you are not sure you will need these results for other time periods as well, you store the returns matrix in simi$sim_data. After the simulation, you add the 100 results in a matrix and add it to simi$exp_value. Your results allow you to estimate the quantiles of the value distribution. You will store them as simi$quantiles. In addition, you store the values such as the mean in simi$mean and the standard deviation in simi$st_dev. The last thing you want to store is the histogram of the final values in simi$plot. For every expected - return volatility combination, you have a separate list. You store this lists in a list simulations.
Let’s create the lists first
simulations <- list()Let’s look at the simulation for the first return - volatility pair.
sim1 and add expected return sim1$exp_return and volatility sim1$volatility to the list. Recall that these values are stored in the first row in mat_data:sim1 <- list(exp_return = mat_data[[1, 1]],
volatility = mat_data[[1, 2]])
sim1$exp_return
[1] 0.0048676
$volatility
[1] 0.034641
# Note that there are other ways to do to. For instance, you could have
# created an empty list `sim1 <- list()` and used `sim1$exp_return <- mat_data[1, 1]`
# started from the empty list and used `sim1 <- c(sim1, "exp_return" = mat_data[1, 1])`. simulations with the name sim1simulations[["sim1"]] <- sim1
# Note that there are other ways to do this, e.g. `simulations$sim1 <- sim1`. Let’s automate this for the other lists. Here the code is given. Try to predict what every line in this code does. Note that sim1 was created. In other words, i can start from 2 and needs to run to 5. Focus on the lines that deal with “lists”.
for (i in 2:5) {
sim_names <- paste0("sim", i)
temp_list <- list(exp_return = mat_data[[i, 1]],
volatility = mat_data[[i, 2]])
simulations[[sim_names]] <- temp_list
}
rm(temp_list)Use the values in sim1$exp_return and sim1$volatility … sim5$exp_return and sim5$volatility to generate a 360 x 100 matrix with random draws from a normal distribution with mean and standard deviation given by exp_return and volatility, add 1 to every element and add this matrix to sim1 … sim5 Do this so that you can rerun the simulations with another set of parameters for the months and draws. In other words, assign the values for the number of draws, ndraws and the number of months nmonths is separate variables. Use these to determine the dimensions of your matrix. Assign this matrix to simi$sim_data.
First let’s look at an example to generate the matrix. Here, call this matrix mat and use the data stored in sim1 to set the mean and standard deviation:
ndraws <- 100
nmonths <- 360
# we need ndraws per month: total of ndraw * nmonths random draws
# store in ndraw columns with one row per month
mat <- matrix(rnorm(n = (ndraws * nmonths),
mean = simulations$sim1$exp_return,
sd = simulations$sim1$volatility),
nrow = nmonths,
ncol = ndraws) + 1Let’s try to automate this process using the lapply() function and add the matrix sim_data to every list witing the simulations list. Recall that you need to wrap the matrix in a list() call. Use function(x) c(list "name" = ) in the lapply() function to do so:
simulations <- lapply(simulations, function(x) c(x, "sim_data" = list(matrix(rnorm(ndraws * nmonths, x$exp_return, x$volatility), nmonths, ndraws) + 1)))Let’s see what the alternative would have been is you would have use a for loop. Here, the code is given. Try to see what these steps do with respect to lists in this simulation (how are they subsetted …).
# for (i in 2:5) {
#
# simulations[[i]]$sim_data <- matrix(rnorm(ndraws * nmonths, simulations[[i]]$exp_return, simulations[[i]]$volatility),
# nrow = nmonths,
# ncol = ndraws) + 1
# }Verify that your results are from the correct normal distribution. To do so, use the sapply() function to create a est_mean and est_volatility matrix as mean and standard deviation of all elements in the sim_data matrix minus 1 (recall that you added one, so here, for this purpose you need to subtract 1):
est_mean <- sapply(simulations, function(x) mean(x$sim_data - 1))
est_volatility <- sapply(simulations, function(x) sd(x$sim_data - 1))You can now use this matrix to determine the value for every one of these 100 draws after 360 months. Assign this vector to simi$exp_value. Recall that you can use the apply() function to calculate the product of all values in a column of a matrix and that you need to simplify the result of apply(). Use the lapply() function to do generate these vectors across the various simulations.
simulations <- lapply(simulations, function(x) c(x, "exp_value" = list(apply(x$sim_data, 2, FUN = prod, simplify = TRUE))))You now have for every euro invested today the value for every euro invested 30 years from now for 5 scenario’s in terms of the expected return and volatility and for 100 simulations across these return-volatility combinations. Use these values to calculate summary statistics: quantiles (with probabilities 10%, 25%, 50%, 75% and 90%), mean and standard deviation. Store these in simi$quantiles, simimean and simi$st_dev. You will need three lines of code using lapply():
simulations <- lapply(simulations, function(x) c(x, "quantiles" = list(quantile(x$exp_value, probs = c(0.10, 0.25, 0.50, 0.75, 0.90), names = TRUE))))
simulations <- lapply(simulations, function(x) c(x, "mean" = mean(x$exp_value, na.rm = TRUE)))
simulations <- lapply(simulations, function(x) c(x, "st_dev" = sd(x$exp_value, na.rm = TRUE)))Now you can generate a plot. Here is the code to generate the plot for sim1. Try to read it to see what the code is doing. Use ?hist or ?plot to see what these lines are doing:
plot_sim <- hist(simulations$sim1$exp_value, probability = TRUE)plot(plot_sim, col = "lightyellow", border = "lightgrey",
xlab = "Expected value",
main = glue::glue("Simulation with expected return {simulations$sim1$exp_return} and volatility {simulations$sim1$volatility}"))Now, generate this plot and store this plot in $plot_sim in every simualations. You can use lapply() to do so. You can leave the plot() code out and only use the part in hist() from the previous code.
simulations <- lapply(simulations, function(x) c(x, "plot_sim" = list(hist(x$exp_value, probability = TRUE))))You now have all your data for your simulations. Now, lets take a closer look at some of the results and answer a couple of questions. Store each answer in a matrix or list as indicated in the question.
mean_sim:mean_sim <- sapply(simulations, function(x) x$mean)
mean_sim sim1 sim2 sim3 sim4 sim5
5.984450 8.213977 11.126300 12.941953 12.151095
low_value and high_value:low_value <- sapply(simulations, function(x) x$quantiles[1])
high_value <- sapply(simulations, function(x) x$quantiles[5])
low_value sim1.10% sim2.10% sim3.10% sim4.10% sim5.10%
1.9962861 1.6127327 1.5882490 1.2373738 0.6525462
high_valuesim1.90% sim2.90% sim3.90% sim4.90% sim5.90%
11.08629 18.71673 24.83965 33.50628 30.25813
vol_sim:vol_sim <- sapply(simulations, function(x) x$st_dev)
vol_sim sim1 sim2 sim3 sim4 sim5
4.757061 9.713904 14.209238 18.484862 19.995729
max_run. Do the same for the lowest value in store in min_run:max_run <- lapply(simulations, function(x) which.max(x$exp_value))
min_run <- lapply(simulations, function(x) which.min(x$exp_value))sim5$sim_data column associated with the highest expected value for every simulation. Store in a vector test:test <- simulations$sim5$sim_data[, max_run$sim5]test is equal to the maximum of the expected values for sim5:prod(test) - simulations$sim5$exp_value[[max_run$sim5]] < 10^(-12)[1] TRUE
\[ (1 + r)^{360} \]
below_ave:below_ave <- lapply(simulations, function(x) sum(x$exp_value < (1 + x$exp_return)^(nmonths)))
below_ave$sim1
[1] 64
$sim2
[1] 66
$sim3
[1] 64
$sim4
[1] 74
$sim5
[1] 79
diff_mean_bool:diff_mean_bool <- lapply(simulations, function(x) (x$mean - (1 + x$exp_return)^(nmonths)) < 0)
diff_mean_bool$sim1
[1] FALSE
$sim2
[1] FALSE
$sim3
[1] FALSE
$sim4
[1] TRUE
$sim5
[1] TRUE
diff_mean:diff_mean <- lapply(simulations, function(x) (x$mean - (1 + x$exp_return)^(nmonths)))
diff_mean$sim1
[1] 0.2408568
$sim2
[1] 0.6018453
$sim3
[1] 1.063751
$sim4
[1] -0.3256147
$sim5
[1] -5.298055
lapply(simulations, function(x)
plot(x$plot_sim, col = "lightyellow", border = "lightgrey",
xlab = "Expected value",
main = glue::glue("Simulation with expected return {x$exp_return} and volatility {x$volatility}")))$sim1
NULL
$sim2
NULL
$sim3
NULL
$sim4
NULL
$sim5
NULL
You can verify your plots in the plots tab in the environment pane. The arrow to the left should allow you to see the 5 plots including a different title.
#| echo: false
#| error: false
#| message: false
#| output: false
#| warning: false
rm(course, cond, ects, hours, stud1, stud2, stud3, stud4, allstud, mat1, list1, plot_norm)You can think about data frames as lists where each column has the same length (as in a matrix) but each column can store a different type of data (as in a list). As in a matrix, a data frame usually has a fixed set of rows and columns but as in a list, these columns can store different types of variables. We will also use a special type of data frame: a tibble. Tibbles are essentially data frames, but with some additional characteristics.
To create a data frame, you can use the date.frame() function. The first argument are the data for the data frame. In addition, you can add row.names = NULL. By default, R doesn’t add row names other than 1, 2, 3, …. Adding a vector (integer or character) with the row names of specifying which column R needs to use for row names changes that default. Two other arguments check the data: check.rows = FALSE checks if the rows are consistent in terms of their length and in terms of their names; check.names = TRUE checks the names of the variables to see if these are valid variables names and not duplicates. The last two arguments, fix.empty.names = TRUE and stringAsFactors = FALSE add an automatically generated name in case the variable names are empty and changes character variables in factors. Let’s create a data frame, df whose values include numbers, logical values, characters and dates:
df <- data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df numbers bools characters dates
1 1 TRUE a 2025-03-25
2 2 FALSE b 2025-03-26
3 3 FALSE c 2025-03-27
4 4 TRUE d 2025-03-28
5 5 TRUE e 2025-03-29
You can verify that this is a data frame using e.g.
is.data.frame(df)[1] TRUE
or from the class
class(df)[1] "data.frame"
Note that a data frame is also a list:
is.list(df)[1] TRUE
Checking the structure of df
str(df)'data.frame': 5 obs. of 4 variables:
$ numbers : num 1 2 3 4 5
$ bools : logi TRUE FALSE FALSE TRUE TRUE
$ characters: chr "a" "b" "c" "d" ...
$ dates : Date, format: "2025-03-25" "2025-03-26" ...
you can see that this structure shows similarities with a named list. From the structure, you can also see that this data frame includes 5 observations for 4 variables. The structure also shows the type of each variable. The length() or ncol() show the number of variables, while nrow() shows the number of observations:
length(df)[1] 4
ncol(df)[1] 4
nrow(df)[1] 5
Recall that ncol() and nrow() allowed you to determine the dimensions of a matrix. To see access the names of the variables, you can use
names(df)[1] "numbers" "bools" "characters" "dates"
R returns a character vector with the names of the variables. If the data includes row names, you can ask see them using
row.names(df)[1] "1" "2" "3" "4" "5"
A data frame’s columns must have the same length (nrows) and R will sometimes force this to happen. To see this, let’s change a couple of arguments in df <- data.frame():
df1 <- data.frame(numbers = 10, bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df1 numbers bools characters dates
1 10 TRUE a 2025-03-25
2 10 FALSE b 2025-03-26
3 10 FALSE c 2025-03-27
4 10 TRUE d 2025-03-28
5 10 TRUE e 2025-03-29
R copies the value “10” and fills the column “numbers” until the number of values equals the number of rows in the data frame. This is called recycling. R recycles single numeric values to fill a column.
df2 <- data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))Error in data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F), characters = letters[1:5], : arguments imply differing number of rows: 5, 3
df2Error: object 'df2' not found
Here, R produces an error. It can not fill the bools column to make sure that its number of values matches the number of rows in the data frame. As R doesn’t know what to do, it will not fill this data frame.
df3 <- data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, T), characters = letters[1:8], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))Error in data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, : arguments imply differing number of rows: 5, 8
df3Error: object 'df3' not found
Here too, R will not execute this command. In this case, R doesn’t know which values to drop from the character vector.
However, with one value, R recycles the character:
df4 <- data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, T), characters = letters[1], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df4 numbers bools characters dates
1 1 TRUE a 2025-03-25
2 2 FALSE a 2025-03-26
3 3 FALSE a 2025-03-27
4 4 TRUE a 2025-03-28
5 5 TRUE a 2025-03-29
To see what the other arguments in the date.frame() function, let’s add them and see how they change the output.
row.names = 3L uses the third column of the data as row names:df <- data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"),
row.names = 3L)
df numbers bools dates
a 1 TRUE 2025-03-25
b 2 FALSE 2025-03-26
c 3 FALSE 2025-03-27
d 4 TRUE 2025-03-28
e 5 TRUE 2025-03-29
c("Obs.A", "Obs.B", "Obs.C", "Obs.D", "Obs.E"):df <- data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"),
row.names = c("Obs.A", "Obs.B", "Obs.C", "Obs.D", "Obs.E"))
df numbers bools characters dates
Obs.A 1 TRUE a 2025-03-25
Obs.B 2 FALSE b 2025-03-26
Obs.C 3 FALSE c 2025-03-27
Obs.D 4 TRUE d 2025-03-28
Obs.E 5 TRUE e 2025-03-29
bools and see what the function returns:df <- data.frame(numbers = c(1, 2, 3, 4, 5), c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df numbers c.T..F..F..T..T. characters dates
1 1 TRUE a 2025-03-25
2 2 FALSE b 2025-03-26
3 3 FALSE c 2025-03-27
4 4 TRUE d 2025-03-28
5 5 TRUE e 2025-03-29
str(df)'data.frame': 5 obs. of 4 variables:
$ numbers : num 1 2 3 4 5
$ c.T..F..F..T..T.: logi TRUE FALSE FALSE TRUE TRUE
$ characters : chr "a" "b" "c" "d" ...
$ dates : Date, format: "2025-03-25" "2025-03-26" ...
Here, R creates the name of the logical variable from the vector c(T, F, F, T, R). Is does so by removing the brackets and replacing comma’s and spaces with dots. If you include the ceck.names = FALSE argument, R will use c(T, F, F, T, R) as a name. If you want to avoid this, you need to use fix.empty.names = FALSE.
df <- data.frame(numbers = c(1, 2, 3, 4, 5), c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"),
fix.empty.names = FALSE)
df numbers characters dates
1 1 TRUE a 2025-03-25
2 2 FALSE b 2025-03-26
3 3 FALSE c 2025-03-27
4 4 TRUE d 2025-03-28
5 5 TRUE e 2025-03-29
str(df)'data.frame': 5 obs. of 4 variables:
$ numbers : num 1 2 3 4 5
$ : logi TRUE FALSE FALSE TRUE TRUE
$ characters: chr "a" "b" "c" "d" ...
$ dates : Date, format: "2025-03-25" "2025-03-26" ...
Here, R leaves the name of the variable empty. You can now set your own name. The last argument stringAsFactors = FALSE keeps characters as characters. Changing this into TRUE converts these characters into factors.
Tibbles are essentially data frames but come with a couple of special features. First, to use tibbles, you need to load the tibble package included in the tidyverse suite of packages. Second, there are a couple of differences in how a tibble and a data frame handle, e.g. printing or subsetting. First, if you print a tibble, it will highlight some special features and will only show the 10 first observations. Data frames show all observations. For long datasets, you need to add a command telling R to show only e.g. 10 lines. Second, tibbles are more strict in terms of subsetting compared to data frames. As we’ll see, a tibble always returns a tibble, while a data frame can return a vector. Last, tibbles allow for non syntatic column names, e.g. var 1.
With respect to the creating of a tibble, the basics are very similar to those for data frames.
To illustrate, let’s create a tibble:
df_tib <- tibble::tibble(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df_tib# A tibble: 5 × 4
numbers bools characters dates
<dbl> <lgl> <chr> <date>
1 1 TRUE a 2025-03-25
2 2 FALSE b 2025-03-26
3 3 FALSE c 2025-03-27
4 4 TRUE d 2025-03-28
5 5 TRUE e 2025-03-29
and compare the result with the date frame:
df <- data.frame(numbers = c(1, 2, 3, 4, 5), bools = c(T, F, F, T, T), characters = letters[1:5], dates = seq.Date(as.Date("2025-03-25"), length.out = 5, by = "day"))
df numbers bools characters dates
1 1 TRUE a 2025-03-25
2 2 FALSE b 2025-03-26
3 3 FALSE c 2025-03-27
4 4 TRUE d 2025-03-28
5 5 TRUE e 2025-03-29
The first thing to note is that result shows the number of rows and columns for a tibble, but not for a data frame. In addition, the tibble also shows the type of the data stored in each column, while the data frame doesn’t show this output. Row names in the tibble are shown in grey, indicating that they were automatically generated. You can verify the class of a tibble:
class(df_tib)[1] "tbl_df" "tbl" "data.frame"
Here, you can see that a tibble is also a data frame. The tibble() function includes 2 arguments in addition to data part: .rows = and .name_repair = c("check_unique", "unique", "universal", "minimal"). The former allows you to add the number of rows. You could add this as a check to see if the number of observations in your dataset matches your expectations or to create an empty tibble using .rows = 0. The latter function allows you to tell R how to treat problematic column names. The default value here is check_unique and verifies if a column has a unique name but doesn’t try to repair the name; universal makes names unique and brings them in line with the R syntax; unique makes sure that there are names that that they are unique while `minimal does not repair or any checks other than verifying is a name exits.
Using head(df, n = ) or tail(df, n = ) you can print the first (head) or last (tail) n lines of a data frame or tibble. Suppose you want to see the first 2 lines of df you would use:
head(df, n = 2) numbers bools characters dates
1 1 TRUE a 2025-03-25
2 2 FALSE b 2025-03-26
To see that last 3 lines of df_tib:
tail(df_tib, n = 3)# A tibble: 3 × 4
numbers bools characters dates
<dbl> <lgl> <chr> <date>
1 3 FALSE c 2025-03-27
2 4 TRUE d 2025-03-28
3 5 TRUE e 2025-03-29
Using as.data.frame() you can change another object in a data frame. Here, the arguments are largely the same as those for data.frame, with the exception that now you need to include an object you want to change into a data frame. For instance, let’s create a 2x3 matrix and add names:
mat <- matrix(round(runif(15), 2), 3, 5)
colnames(mat) <- paste("var", 1:5, sep = "_")
rownames(mat) <- paste("obs", 1:3, sep = "_")
mat var_1 var_2 var_3 var_4 var_5
obs_1 0.52 0.71 0.46 0.93 0.25
obs_2 0.34 0.32 0.76 0.12 0.37
obs_3 0.94 0.46 0.39 0.81 0.84
Changing this matrix in a data frame, using as.data.frame():
mat_df <- as.data.frame(mat)
mat_df var_1 var_2 var_3 var_4 var_5
obs_1 0.52 0.71 0.46 0.93 0.25
obs_2 0.34 0.32 0.76 0.12 0.37
obs_3 0.94 0.46 0.39 0.81 0.84
str(mat_df)'data.frame': 3 obs. of 5 variables:
$ var_1: num 0.52 0.34 0.94
$ var_2: num 0.71 0.32 0.46
$ var_3: num 0.46 0.76 0.39
$ var_4: num 0.93 0.12 0.81
$ var_5: num 0.25 0.37 0.84
Note that R used the row and column names of the matrix to add row and column names to the data frame. You can use your own row names if you add them via row.names = c() to the as.data.frame() function. Note that you can change a date frame (with only numeric variables) into a matrix. This allows you to use matrix operators (matrix algebra). Often this is much faster than writing code to perform the same calculations on a data frame. Using as.data.frame() you can then change the type of your matrix back into a data frame.
You can also change other objects in a data frame. For instance, here is a list
list1 <- list(
company = c("Firm A", "Firm B", "Firm C", "Firm D", "Firm E"),
sales = runif(5, min = 100000, max = 1000000),
margin = runif(5, min = 0.20, max = 0.36),
region = as.factor(c(1, 1, 2, 2, 2)))Using as.data.frame():
list1_df <- as.data.frame(list1)
list1_df company sales margin region
1 Firm A 733606.0 0.2465040 1
2 Firm B 554201.6 0.3128367 1
3 Firm C 860078.6 0.2024773 2
4 Firm D 789595.5 0.2356981 2
5 Firm E 608390.8 0.2843190 2
Changes the list into a data frame. What happens with nested lists? To see this, let’s generate a second list:
list2 <- list(
company = c("Firm F", "Firm G", "Firm H", "Firm I", "Firm J"),
sales = runif(5, min = 1000, max = 10000),
margin = runif(5, min = 0.10, max = 0.16),
region = as.factor(c(1, 1, 3, 3, 3)))and create a nested list lest_nest using list1 and list2
list_nest <- list(list1, list2)Changing list_nest into a data frame:
list_nest_df <- as.data.frame(list_nest)
list_nest_df company sales margin region company.1 sales.1 margin.1 region.1
1 Firm A 733606.0 0.2465040 1 Firm F 9041.202 0.1170825 1
2 Firm B 554201.6 0.3128367 1 Firm G 3581.459 0.1576101 1
3 Firm C 860078.6 0.2024773 2 Firm H 9732.313 0.1462119 3
4 Firm D 789595.5 0.2356981 2 Firm I 3881.885 0.1210791 3
5 Firm E 608390.8 0.2843190 2 Firm J 2148.158 0.1326172 3
creates a data frame of 8 variables and 5 observations, not a data frame with 4 variables and 10 observations. In other words, here, you’ll need to change the lists on the second level into data frames first e.g. using
list_nest <- lapply(list_nest, function(x) as.data.frame(x))and then use list_nest to extract the data frames. If all data frames in the nested list include the same variables, you can use rbind() to add them into one data frame. We will discuss rbind() for data frames more in depth in the next section. However, recall that you have used this function to add rows for matrices.
Using as_tibble() you need to specify the object will be changed in a tibble. In addition, you can add the .name_repair = c("check_unique", "unique", "universal", "minimal") argument to repair names. Note that as.tibble (with a dot) also exists. This function has been replaced by as_tibble(). For instance, to coerce a matric into a tibble:
mat_tib <- tibble::as_tibble(mat)
mat_tib# A tibble: 3 × 5
var_1 var_2 var_3 var_4 var_5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.52 0.71 0.46 0.93 0.25
2 0.34 0.32 0.76 0.12 0.37
3 0.94 0.46 0.39 0.81 0.84
Note that as_tibble() doesn’t include the row names. To do so, you need to add a variable where R can store the row names in a tibble. To so do, you add the argument rownames = "name" in the as_tibble() function. Doing so, the function will add the rownames from mat as a separate variable to the tibble. The name of this variable is name. For instance, adding the row names of mat to a variable rows in the mat_tib
mat_tib <- tibble::as_tibble(mat, rownames = "rows")
mat_tib# A tibble: 3 × 6
rows var_1 var_2 var_3 var_4 var_5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 obs_1 0.52 0.71 0.46 0.93 0.25
2 obs_2 0.34 0.32 0.76 0.12 0.37
3 obs_3 0.94 0.46 0.39 0.81 0.84
You can also change a data frame into a tibble:
df_tib <- tibble::as_tibble(mat_df)
df_tib# A tibble: 3 × 5
var_1 var_2 var_3 var_4 var_5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.52 0.71 0.46 0.93 0.25
2 0.34 0.32 0.76 0.12 0.37
3 0.94 0.46 0.39 0.81 0.84
If your data frame has row names and you would like to keep them, you need to add rownames = "name in the as_tibble() function:
df_tib <- tibble::as_tibble(mat_df, rownames = "abcdef")
df_tib# A tibble: 3 × 6
abcdef var_1 var_2 var_3 var_4 var_5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 obs_1 0.52 0.71 0.46 0.93 0.25
2 obs_2 0.34 0.32 0.76 0.12 0.37
3 obs_3 0.94 0.46 0.39 0.81 0.84
Many functions used to import data in R return a data frame, for instance: read.csv will import tabular data and return a data frame. This same also holds for many other packages that allow you to import data. We will cover examples in Chapter 6.
Recall that a data frame borrows characteristics from both a list and a matrix. In other words, you can use both the column number as well as its name to subset a column. Recall that there are 3 substting operators: [], [[]] and $. Let’s use each of them with a data frame.
[] with column indices or column names:df[1] numbers
1 1
2 2
3 3
4 4
5 5
df["numbers"] numbers
1 1
2 2
3 3
4 4
5 5
class(df["numbers"])[1] "data.frame"
As you can see, here too, as with lists, the [] operator is a preserving operator: it preserves that characteristics of the data frame. Note also see that the syntax resembles the list-syntax: you only include the column you want to extract and you don’t use e.g. [, 2] or [, "numbers"]. In other words, you don’t include an index for the rows. With data frames, R assumes you need all rows of the column. If you apply the operator to a tibble, the tibble structure will be preserved as well. Note the difference with subsetting a column with a matrix. There, we used mat[, i] to extract the ith column. Here, there is no reference to a row. You could use df[, 1] to subset the first column. However, in that case, R will return an unnamed vector if the subsetting is applied to a data frame. In other words, R simplifies the result as much as possible: treating a df as a matrix, causes R to simplify the output if possible. Doing the same with a tibble, will not cause a simplified result. Applied to a tibble, `[, 1] will return a tibble.
df[, 1][1] 1 2 3 4 5
is.vector(df[, 1])[1] TRUE
[[]] with column indices or column namesdf[[1]][1] 1 2 3 4 5
df[["numbers"]][1] 1 2 3 4 5
class(df[["numbers"]])[1] "numeric"
is.vector(df[["numbers"]])[1] TRUE
This operator returns a simplified result. Here, the first column is no longer a data frame but a vector. In other words [[]] act, as was the case with lists, as the simplifying operator.
$ with column namesdf$numbers[1] 1 2 3 4 5
class(df$numbers)[1] "numeric"
is.vector(df$numbers)[1] TRUE
As was the case with lists, the $ operator with a data frame returns a simplified result. In other words, df$numbers is equivalent to df[["numbers]]. The $ operator is the most widely used to subset columns in a data frame or tibbles. However, there is one difference between data frames and tibbles. Data frames allows for partial matching while tibbles don’t. For instance, with a data frame:
df$numb[1] 1 2 3 4 5
will work even if there is not variable numb. Doing so with a tibble wouldn’t work:
df_tbl <- tibble::as_tibble(df)
df_tbl$numbWarning: Unknown or uninitialised column: `numb`.
NULL
as you can see, R didn’t extract the values and gave a warning message.
With respect the multiple columns of negative index positions, data frames and tibbles are comparable to lists, vectors or matrices: a negative index position extracts all but the column with the negative index
df[-4] numbers bools characters
1 1 TRUE a
2 2 FALSE b
3 3 FALSE c
4 4 TRUE d
5 5 TRUE e
and selecting two ore more columns is similar to lists or matrices, e.g.:
df[c(1, 4)] numbers dates
1 1 2025-03-25
2 2 2025-03-26
3 3 2025-03-27
4 4 2025-03-28
5 5 2025-03-29
You can extract a column using the pipe operator. Using base R’s pipe:
df |> _$numbers[1] 1 2 3 4 5
returns df$numbers. This holds also for tibbles. Note that in case you would use magrittr pipe, you would need to change the _ in a dot ..
There are three ways to subset an individual value. They all return the same output:
df[2, 3][1] "b"
df[[2, 3]][1] "b"
df$characters[2][1] "b"
Negative indices extract all but that value, e.g.
df[-2, 3][1] "a" "c" "d" "e"
extracts all but the second row of the third column of df.
Here, there is no difference between a tibble and a data frame.
Data frames show a lot of similarities with other data structures in terms of how you can use logical vectors to subset columns of rows. For instance extracting the dates on the condition that the value in the column numbers is larger than 2:
cond <- df$numbers > 2
df$dates[cond][1] "2025-03-27" "2025-03-28" "2025-03-29"
or selecting multiple columns conditional upon numbers being larger than 2:
df[cond, 1:3] numbers bools characters
3 3 FALSE c
4 4 TRUE d
5 5 TRUE e
As you could with the other data structures you can also extract columns using e.g. grepl(). Extracting variables whose name includes “numbers” or “dates” for instance, can be done using:
df[grepl(pattern = "numbers|dates", colnames(df))] numbers dates
1 1 2025-03-25
2 2 2025-03-26
3 3 2025-03-27
4 4 2025-03-28
5 5 2025-03-29
As an alternative, the subset(x, subset, select, drop = FALSE, ...) function allows you to select the variables in a data frame df in select using a condition in subset. For instance, selecting columns “numbers”, “bools” and “character” for the rows where “numbers” is larger than 2:
subset(df, df$numbers > 2, c("numbers", "bools", "characters")) numbers bools characters
3 3 FALSE c
4 4 TRUE d
5 5 TRUE e
Recall that you extracted these values also using df[df$numbers > 2, 1:3].
In subsequent chapters, we’ll use {dplyr}’s filter() and select() function to selects observations (filter()) and variables (select).
Changing individual elements of a data frame is straightforward: you reassign their value as you did for vectors or matrices.
With respect to data frames, you can use cbind() and rbind() to add columns and rows to a data frame. These columns can be stored in vector, matrices or data frames. Recall that we used these function also for matrices. Suppose that you have a data frame df1 and vectors D and E. As you can see, df1 has 4 rows and 3 variables. As you may recall from the section on matrices, this means the columns you want to add need at least 4 rows and the rows you want to add need at least 3 columns.
df1 <- data.frame(A = c(11, 21, 31, 41), B = c(12, 22, 32, 42), C = c(13, 23, 33, 43))
D <- c(14, 24, 34, 44)
E <- c(51, 52, 53, 54)Let’s now use cbind() to add the vector D to df1:
cbind(df1, D) A B C D
1 11 12 13 14
2 21 22 23 24
3 31 32 33 34
4 41 42 43 44
Here, R used the name of the vector as a variable name in the data frame df1. What if the vector is not named. To see what happens, let’s use
cbind(df1, c(10, 11, 12, 13)) A B C c(10, 11, 12, 13)
1 11 12 13 10
2 21 22 23 11
3 31 32 33 12
4 41 42 43 13
As you can see, R selects a name from the values of the vector that was added. In other words, if the vector or matrix isn’t named, you need to add names before using cbind() or set names afterwards. Recall that you can create a component in a list using list$component <- .... As data frames are lists, you can use the same approach to add a new variable to a dataset. For instance, to add the vector D to df1 you can also use
df1$D <- c(14, 24, 34, 44)
df1 A B C D
1 11 12 13 14
2 21 22 23 24
3 31 32 33 34
4 41 42 43 44
Adding rows uses rbind(). Adding rows to a data frame is only relevant when the row you add include observations for the same variables. Suppose that the vector E included observations for variables A, B and C. Using rbind() you can add them to the data frame:
rbind(df1, E) A B C D
1 11 12 13 14
2 21 22 23 24
3 31 32 33 34
4 41 42 43 44
5 51 52 53 54
To add a data frame df2
df2 <- data.frame(G = c(18, 28, 38, 48), H = c(19, 29, 39, 49))to df1, you can use the same functions. For instance, adding the columns of df2 to those of df1:
cbind(df1, df2) A B C D G H
1 11 12 13 14 18 19
2 21 22 23 24 28 29
3 31 32 33 34 38 39
4 41 42 43 44 48 49
and adding the rows of df3
df3 <- data.frame(A = c(51, 61), B = c(52, 62), C = c(53, 63), D = c(54, 64))to those of df1 using rbind():
rbind(df1, df3) A B C D
1 11 12 13 14
2 21 22 23 24
3 31 32 33 34
4 41 42 43 44
5 51 52 53 54
6 61 62 63 64
Often you want to create a new variable where you use other values in your dataset. There are a couple of ways to do so. First you can create a new variable and add the calculation on the right hand side of the assignment operator. As an example, suppose that you want to add the log of A to df1. To do so, you can use
df1$logA <- log(df1$A)
df1 A B C D logA
1 11 12 13 14 2.397895
2 21 22 23 24 3.044522
3 31 32 33 34 3.433987
4 41 42 43 44 3.713572
Using the with(data, expression, ...) you can avoid the references to the data frame in the calculation. The first argument in the function is the data frame where R will look for the variables used in expression. In other words, with(df1 ...) allows you to eliminate df1$ in your calculation. If you use A in that expression, R knows that this A is a variable included in df1. To add a column to df1 calculated as the ratio of df1$A/df1$B you would use:
df1$ratioAB <- with(df1, A/B)
df1 A B C D logA ratioAB
1 11 12 13 14 2.397895 0.9166667
2 21 22 23 24 3.044522 0.9545455
3 31 32 33 34 3.433987 0.9687500
4 41 42 43 44 3.713572 0.9761905
Without this function, you would have to write
df1$ratioABalt <- df1$A/df1$B
df1 A B C D logA ratioAB ratioABalt
1 11 12 13 14 2.397895 0.9166667 0.9166667
2 21 22 23 24 3.044522 0.9545455 0.9545455
3 31 32 33 34 3.433987 0.9687500 0.9687500
4 41 42 43 44 3.713572 0.9761905 0.9761905
Using with() you have to assign the result of a calculation to the data frame using df$newvar. Using the within() function, you can avoid this. This function has the same arguments as the with() function, but you add the name of the new variable in the expression part. The within() function returns a new data frame which is a copy of the old data frame plus the columns you added in the expression. In other words, the within() function preserves the “old” data frame and you have to assign the result of within() to a new data frame if you want to access these new values. If you are sure you won’t need the old data frame, you can assign the result of within() to that old data frame. Using within() also allows you to add multiple expressions. As an example, suppose you want to add the sum of A and B as well as the difference between D and C to the data frame (note the {} and the fact that every new variable has a new line without a comma at the end of the line):
dfnew1 <- within(df1, {
sumAB <- A + B
diffDC <- D - C
})
dfnew1 A B C D logA ratioAB ratioABalt diffDC sumAB
1 11 12 13 14 2.397895 0.9166667 0.9166667 1 23
2 21 22 23 24 3.044522 0.9545455 0.9545455 1 43
3 31 32 33 34 3.433987 0.9687500 0.9687500 1 63
4 41 42 43 44 3.713572 0.9761905 0.9761905 1 83
If you assign the results to an existing variable, within() overwrites this variable:
dfnew2 <- within(df1, {
A <- A / 10
B <- B * 10
C <- C / D
})
dfnew2 A B C D logA ratioAB ratioABalt
1 1.1 120 0.9285714 14 2.397895 0.9166667 0.9166667
2 2.1 220 0.9583333 24 3.044522 0.9545455 0.9545455
3 3.1 320 0.9705882 34 3.433987 0.9687500 0.9687500
4 4.1 420 0.9772727 44 3.713572 0.9761905 0.9761905
Note that you need to be careful when you design the sequance of expressions. For instance, if you first change A, and then use the value of A in your expression for B, R will use the new values for A as it doesn’t recall what the values of A where before you changed them.
To delete rows and columns, you can use the familiar way. For instance, you can use
df4 <- df1[-3]
df4 A B D logA ratioAB ratioABalt
1 11 12 14 2.397895 0.9166667 0.9166667
2 21 22 24 3.044522 0.9545455 0.9545455
3 31 32 34 3.433987 0.9687500 0.9687500
4 41 42 44 3.713572 0.9761905 0.9761905
NULL to remove column “B”df1$B <- NULL
df1 A C D logA ratioAB ratioABalt
1 11 13 14 2.397895 0.9166667 0.9166667
2 21 23 24 3.044522 0.9545455 0.9545455
3 31 33 34 3.433987 0.9687500 0.9687500
4 41 43 44 3.713572 0.9761905 0.9761905
df1[df1$A == 31, ] A C D logA ratioAB ratioABalt
3 31 33 34 3.433987 0.96875 0.96875
Using the within() function, you can use the <- NULL to delete multiple columns from your data frame:
dfnew1 <- within(dfnew1, {
A <- NULL
ratioAB <- NULL
ratioABalt <- NULL
sumAB <- NULL
})
dfnew1 B C D logA diffDC
1 12 13 14 2.397895 1
2 22 23 24 3.044522 1
3 32 33 34 3.433987 1
4 42 43 44 3.713572 1
There is little difference between the approach you use to functions on a data frame and those for vectors, matrices or lists. This shouldn’t come as a surprise as a data frame is a list which characteristics of a matrix and R functions are vectorized. A couple of examples to illustrate some functions:
summary((df1)) A C D logA ratioAB
Min. :11.0 Min. :13.0 Min. :14.0 Min. :2.398 Min. :0.9167
1st Qu.:18.5 1st Qu.:20.5 1st Qu.:21.5 1st Qu.:2.883 1st Qu.:0.9451
Median :26.0 Median :28.0 Median :29.0 Median :3.239 Median :0.9616
Mean :26.0 Mean :28.0 Mean :29.0 Mean :3.147 Mean :0.9540
3rd Qu.:33.5 3rd Qu.:35.5 3rd Qu.:36.5 3rd Qu.:3.504 3rd Qu.:0.9706
Max. :41.0 Max. :43.0 Max. :44.0 Max. :3.714 Max. :0.9762
ratioABalt
Min. :0.9167
1st Qu.:0.9451
Median :0.9616
Mean :0.9540
3rd Qu.:0.9706
Max. :0.9762
colMeans(df1) A C D logA ratioAB ratioABalt
26.0000000 28.0000000 29.0000000 3.1474942 0.9540381 0.9540381
rowMeans(df1)[1] 7.038538 12.158936 17.228581 22.277659
colSums(df1) A C D logA ratioAB ratioABalt
104.000000 112.000000 116.000000 12.589977 3.816153 3.816153
rowSums(df1)[1] 42.23123 72.95361 103.37149 133.66595
apply() function: mean per column:apply(df1, 2, mean) A C D logA ratioAB ratioABalt
26.0000000 28.0000000 29.0000000 3.1474942 0.9540381 0.9540381
lapply() function: standard deviation per column:lapply(df1, \(x) sd(x))$A
[1] 12.90994
$C
[1] 12.90994
$D
[1] 12.90994
$logA
[1] 0.5700948
$ratioAB
[1] 0.02648301
$ratioABalt
[1] 0.02648301
sapply() function: maximum per column:sapply(df1, \(x) max(x)) A C D logA ratioAB ratioABalt
41.0000000 43.0000000 44.0000000 3.7135721 0.9761905 0.9761905
Create a 20x3 matrix mat with rownames obs_1 … and variable names var_1 … whose values are drawn from a uniform distribution with minimum 50 and maximum 100:
rn <- paste("obs", 1:20, sep = "_")
cn <- paste("var", 1:3, sep = "_")
mat <- matrix(runif(60, 50, 100), 20, 3, dimnames = list(rn, cn))Create a data frame mat_df and a tibble mat_tb. Note that for the tibble, you need to include tibble::
mat_df <- as.data.frame(mat)
mat_tb <- tibble::as_tibble(mat)Extract the column var_1 from both and store in col_df and col_tb using the $ operator:
col_df <- mat_df$var_1
col_tb <- mat_tb$var_1Check the class of both these columns you extracted:
typeof(col_df)[1] "double"
typeof(col_tb)[1] "double"
Let’s now use a real dataset, mtcars, which is part of your R installation. Assign this dataset to a data frame df:
df <- mtcarsUse df to create a tibble tb of mtcars:
tb <- tibble::as_tibble(df)Print both datasets by running only their name
df mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
tb# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
What is the difference in result between a data frame and a tibble?
Tell R to keep the row names from df when it creates the tibble tb and store the results in models:
tb <- tibble::as_tibble(df, rownames = "models")
tb# A tibble: 32 × 12
models mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
Extract the column hp from the df data frame and assign this column to a variable hp
hp <- df$hpExtract the column disp from the tibble tb using the [] operator. Assign this column to a variable disp:
disp <- tb["disp"]If you ask R to print this variable (do this in the console) what do you expect will happen: R prints all lines or R prints the first 10 lines?
Extract from df the observations for cars that include a digit at the end of their name (e.g. Duster 360, Mazda RX4):
pat <- "\\d+$"
df[grepl(pattern = pat, x = row.names(df)), ] mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Do the same, but now, use the tibble tb
pat <- "\\d+$"
tb[grepl(pattern = pat, x = tb$models), ]# A tibble: 9 × 12
models mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
3 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
4 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
5 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
6 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
7 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
8 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
9 Porsche 914… 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
Extract all observations from df whose am == 1:
df[df$am == 1, ] mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Add a new variable to the tibble, tb$mpg_cyl, calculated as the ratio of the variable mpg and cyl:
tb$mpg_cyl <- with(tb, mpg/cyl)
tb# A tibble: 32 × 13
models mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 1 more variable: mpg_cyl <dbl>
Use the within() function to add 3 columns to df: mgp/cyl, mgp/hp and mpg/disp. Store these in mpg_cyl, mpg_hp and mpg_disp. Overwrite df and show the first 5 lines of this new data frame using `head(x, n = 5):
df <- within(df, {
mpg_cyl <- mpg/cyl
mpg_hp <- mpg/hp
mpg_disp <- mpg/disp
})
head(df, n = 5) mpg cyl disp hp drat wt qsec vs am gear carb mpg_disp
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.13125000
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.13125000
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.21111111
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.08294574
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.05194444
mpg_hp mpg_cyl
Mazda RX4 0.1909091 3.500000
Mazda RX4 Wag 0.1909091 3.500000
Datsun 710 0.2451613 5.700000
Hornet 4 Drive 0.1945455 3.566667
Hornet Sportabout 0.1068571 2.337500
Use apply() to calculate the mean per variable in df:
apply(df, 2, mean, na.rm = TRUE) mpg cyl disp hp drat wt
20.0906250 6.1875000 230.7218750 146.6875000 3.5965625 3.2172500
qsec vs am gear carb mpg_disp
17.8487500 0.4375000 0.4062500 3.6875000 2.8125000 0.1398688
mpg_hp mpg_cyl
0.1905456 3.8369792
Do the same, but now for the tibble tb:
cond <- sapply(tb, \(x) is.numeric(x))
apply(tb[cond], 2, mean) mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb mpg_cyl
0.437500 0.406250 3.687500 2.812500 3.836979
Ask for a summary table of df
summary(df) mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb mpg_disp
Min. :0.0000 Min. :3.000 Min. :1.000 Min. :0.02203
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:0.04956
Median :0.0000 Median :4.000 Median :2.000 Median :0.09458
Mean :0.4062 Mean :3.688 Mean :2.812 Mean :0.13987
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:0.17740
Max. :1.0000 Max. :5.000 Max. :8.000 Max. :0.47679
mpg_hp mpg_cyl
Min. :0.04478 Min. :1.300
1st Qu.:0.08944 1st Qu.:1.928
Median :0.15041 Median :3.108
Mean :0.19055 Mean :3.837
3rd Qu.:0.24129 3rd Qu.:5.700
Max. :0.58462 Max. :8.475
Predict the outcome if you would run summary(tb). Create the same table as the result of df but now for tb:
summary(tb[cond]) mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb mpg_cyl
Min. :0.0000 Min. :3.000 Min. :1.000 Min. :1.300
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.928
Median :0.0000 Median :4.000 Median :2.000 Median :3.108
Mean :0.4062 Mean :3.688 Mean :2.812 Mean :3.837
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.700
Max. :1.0000 Max. :5.000 Max. :8.000 Max. :8.475
Time series are special as their observations are observed, measured or recorded at the specific moment in time (a date or a data/time). In economics and management, a lost of data come in the form of a time series: sales are measured per month quarter or year, accounting data refers to a specific year, semester of quarter, stock prices are recorded by day, hour or minute, inflation or unemployment are usually reported on a monthly basis. This property has a couple of consequences. First, these observations are ordered. The data or time allows to say which observation comes first, which second and which observation comes last. Extracting observations from a time needs to preserve this property. Second, most time frames can be aggregated. For instance, a week is a aggregation of ways, a year an aggregation of quarters, months, weeks or days and an hour is an aggregation of minutes. In order words, you can start from a monthly time series and generate a yearly series. How you do so depends on the series. For instance, you can add 4 quarters of sales to calculate yearly sales. However, this is not the case for, e.g. stock market prices where the sum of prices across time doesn’t make sense. Here, you would need another measure e.g. the price at the end of the last hour of trading as your price for the day or the last price at the end of the month for a monthly series with stock market prices. Third, time can be regular or irregular. If time is regular, then you measure something at evenly spaced moments in time: every month, every year of every minute. If time series are irregular, this is not the case. For instance, if you measure the noise generated by departing airplanes in areas close to the airport, you’ll have measure each time an airplane takes off. Here, you time will show irregular intervals.
In addition to pure time series, a lot of datasets include both cross sections (e.g. firms) as well as time series (e.g. sales per year). This is called a panel dataset: for every firm, country, household, … in your dataset, you observe variables at multiple times e.g. on observation for every year for the last 10 years. If you have a dataset that includes sales data for 50 products, you panel dataset includes 500 observations: for every product, you have 10 observations: one per year for each of the 10 years in your dataset.
To handle time series, R includes the ts() class. This class is uses regular time intervals. In addition, there are many packages that extend the ability of R to use time series e.g. {zoo} or {xts}. These packages also allow irregular time intervals. The time series equivalent of a tibble is called a tsibble and is used in the {tsibble} package (Wang, Cook, and Hyndman (2020)). This package allows you to change, mutate or time series data. Using these formats, packages such as {quantmod}, {tidyfinance}, {forecast} or {econometrics} all use these formats to e.g. develop quantitative trading strategies ({quantmod}), analyse financial data ({tidyfinance}), develop forecasts ({forecast}) or estimate regressions including methods for time series ({econometrics}).
In this section we will use base R’s ts() as wel as the {xts} (eXtendible time series) package. The latter automatically installs {zoo}. To install {xts} you run
if (!require("xts")) install.packages("xts")Loading required package: xts
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
ts()To create a time series, you need to include both the data as well as the date/time values. With respect to the first, let’s create a vector with 25 values drawn as a sequence starting at 10 in steps of 10:
data <- seq(10, by = 10, length.out = 25)Note that data could also include a matrix or a data frame. We now want to create a time series. To do so, we need to add the “data/time” dimension. Using base R’s ts() function, you can add a start, end and a frequency. The start is included as a value or a vector. For instance start = 2001 is start the series in 2001, start = c(2001, 1) will start the series in 2001-01. The frequency shows the sampling frequency of the time series, 1 would refer to year, 4 refers to a quarterly data and 12 to monthly. Specifying the start and frequency allows R to determine the end date from the length of the series. Let’s create a yearly time series for the values in data starting in 2000. To do so, we use:
ts_data_year <- ts(data, start = 2000, frequency = 1)If you print the series,
ts_data_yearTime Series:
Start = 2000
End = 2024
Frequency = 1
[1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
[20] 200 210 220 230 240 250
you see that R created a time series with start in 2000, end in 2024 with frequency equal to 1, i.e. yearly.
To create quarterly data, you can use
ts_data_quar <- ts(data, start = c(2015, 1), frequency = 4)
ts_data_quar Qtr1 Qtr2 Qtr3 Qtr4
2015 10 20 30 40
2016 50 60 70 80
2017 90 100 110 120
2018 130 140 150 160
2019 170 180 190 200
2020 210 220 230 240
2021 250
R adds the reference to quarters and determines the final quarter from the length of the data. To create a monthly series starting in june, you change the frequency to 12 and change the start month:
ts_data_mont <- ts(data, start = c(2023, 6), frequency = 12)
ts_data_mont Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2023 10 20 30 40 50 60 70
2024 80 90 100 110 120 130 140 150 160 170 180 190
2025 200 210 220 230 240 250
You can verify that these series are time series using class(). For instance, to check of ts_data_mont is a time series, you use
class(ts_data_mont)[1] "ts"
You can extend this example to e.g. matrices. In that case, data will be a matrix.
Fist let’s load the package
library(xts)As you can see, this package loads another package {zoo}. This is because the {xts} relies on some of the functions in the {zoo} package.
Let’s now create a date/time variable using using seq.POSIXt() with length 25 (consistent with the the length of data) and in intervals of months:
datetime <- seq.POSIXt(from = as.POSIXct("2022-03-25"), length.out = 25, by = "months")Creating an {xts} object now uses as.xts(). The first argument if this function is the dataset, in this case data. The second argument is the date/time variable:
data_xts <- as.xts(data, datetime)Inspecting this time series, shows that the dates/times are added to data as row names and in the usual ISO format: %Y-%m-%d.
data_xts [,1]
2022-03-25 10
2022-04-25 20
2022-05-25 30
2022-06-25 40
2022-07-25 50
2022-08-25 60
2022-09-25 70
2022-10-25 80
2022-11-25 90
2022-12-25 100
2023-01-25 110
2023-02-25 120
2023-03-25 130
2023-04-25 140
2023-05-25 150
2023-06-25 160
2023-07-25 170
2023-08-25 180
2023-09-25 190
2023-10-25 200
2023-11-25 210
2023-12-25 220
2024-01-25 230
2024-02-25 240
2024-03-25 250
The class of the time series data_xts is “xts”, “zoo”. The latter is included because the former builds on the latter.
class(data_xts)[1] "xts" "zoo"
Note that the data in {xts} are essentially matrices. In other words, and {xts} object can not store more than one variable type. For most applications, this is usually not too much of an issue. However, if you data includes a mix of types, you’ll need to store the numeric variables in a separate data set.
You can coerce other data structures into a time series object. To illustrate this, let’s first create two other objects: a 50x4 matrix and a data frame. Let’s first create a matrix with values and add a matrix with 50 monthly dates.
mat1 <- matrix(runif(200, min = 50, max = 100), 50, 4)
colnames(mat1) <- paste("var", 1:4, sep = "_")
rownames(mat1) <- paste("obs", 1:50, sep = "_")
mat_dates <- seq.POSIXt(from = as.POSIXct("2020-03-25"),
length.out = 50,
by = "months")
mat <- cbind(mat_dates, mat1)Recall that a matrix is a homogeneous structure. In other words, the dates will be converted into numeric format.
Using this matrix, we can create a data frame. Here, we can add various types of data.
mat1_df <- as.data.frame(mat1, row.names = rownames(mat1))
mat_df <- cbind(mat_dates, mat1_df)Note that in this case, the column with dates is shown as a date/time variable. Let’s now use the {xts} package to coerce both in a time series format. The as.xts() function has multiple arguments: as.xts(x, order.by, dateFormat = "POSIXct", ...). The first, x is the matrix or data frame. The second, order.by = should include a variable that allows R to order the values in x. The dataFormat argument allows you to change the format from the default POSIXct to e.g. Date. Let’s use this function to change the matrix into a time series:
mat_ts <- as.xts(mat, order.by = as.POSIXct(mat[, 1], format = "%Y-%m-%d"))
head(mat_ts, 5) mat_dates var_1 var_2 var_3 var_4
2020-03-25 1585090800 61.81532 71.34831 84.25773 87.37033
2020-04-25 1587765600 83.67822 91.79865 79.53507 90.04232
2020-05-25 1590357600 73.80289 58.45403 62.74126 71.43401
2020-06-25 1593036000 62.37750 72.38072 71.67449 92.25952
2020-07-25 1595628000 64.98585 65.08229 91.09376 85.55991
The function returns a time series, where it used the dates in mat_dates in the first column of mat to add date/time values to the matrix. In doing to, it kept mat_dates as a separate numeric variable in the data set.
The data frame includes the data/time variable as a POSIXct type. In other words, the time series includes the date as a separate variable. As a result, you don’t need to coerce that variable into a date in the as.xts() functions. It if sufficient to include it in the order.by = argument:
mat_dfts <- as.xts(mat_df, order.by = mat_df$mat_dates)
head(mat_dfts, 5) mat_dates var_1 var_2 var_3 var_4
2020-03-25 2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 2020-06-25 62.37750 72.38072 71.67449 92.25952
2020-07-25 2020-07-25 64.98585 65.08229 91.09376 85.55991
Note that here too, R kept the mat_dates variable in the time series dataset. However, here you are including various data types in an xts object. Recall that these objects are essentially matrices. R will change the type of these variables. To avoid that, you need to exclude this mat_dates variable from the coercion:
mat_dfts <- as.xts(mat_df[, 2:5], order.by = mat_df$mat_dates)
head(mat_dfts, 5) var_1 var_2 var_3 var_4
2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 62.37750 72.38072 71.67449 92.25952
2020-07-25 64.98585 65.08229 91.09376 85.55991
Let’s now use mat_dfts to extract specific variables. Most subsetting approaches that we covered for other data structures can be used for xts time series as well. Note that here, if you use the preserving subsetting operator [], the result will always show the relevant data/time as R preserves the structure of the dataset. For example:
head(mat_dfts[, 2:3], n = 5) var_2 var_3
2020-03-25 71.34831 84.25773
2020-04-25 91.79865 79.53507
2020-05-25 58.45403 62.74126
2020-06-25 72.38072 71.67449
2020-07-25 65.08229 91.09376
head(mat_dfts[, -1], n = 5) var_2 var_3 var_4
2020-03-25 71.34831 84.25773 87.37033
2020-04-25 91.79865 79.53507 90.04232
2020-05-25 58.45403 62.74126 71.43401
2020-06-25 72.38072 71.67449 92.25952
2020-07-25 65.08229 91.09376 85.55991
mat_dfts[4, ] var_1 var_2 var_3 var_4
2020-06-25 62.3775 72.38072 71.67449 92.25952
Using the $ operator, you can extract variables, e.g.
head(mat_dfts$var_1, n = 10) var_1
2020-03-25 61.81532
2020-04-25 83.67822
2020-05-25 73.80289
2020-06-25 62.37750
2020-07-25 64.98585
2020-08-25 52.63511
2020-09-25 66.57814
2020-10-25 66.31016
2020-11-25 53.81508
2020-12-25 50.25629
In addition, and specifically for time series, you can use the date/times to extract specific components. For instance:
mat_dfts["2020-07-25"] var_1 var_2 var_3 var_4
2020-07-25 64.98585 65.08229 91.09376 85.55991
mat_dfts["2020-03-25/2020-07-25"] var_1 var_2 var_3 var_4
2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 62.37750 72.38072 71.67449 92.25952
2020-07-25 64.98585 65.08229 91.09376 85.55991
mat_dfts["/2020-07-25"] var_1 var_2 var_3 var_4
2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 62.37750 72.38072 71.67449 92.25952
2020-07-25 64.98585 65.08229 91.09376 85.55991
mat_dfts["2023-12-25/"] var_1 var_2 var_3 var_4
2023-12-25 76.81282 78.46668 88.45526 82.58745
2024-01-25 91.99313 69.49864 60.66829 86.80344
2024-02-25 80.60366 63.13936 74.42565 97.02449
2024-03-25 81.18256 90.97987 52.96503 61.63764
2024-04-25 76.47097 99.46359 91.42821 96.14635
mat_dfts["2022"] var_1 var_2 var_3 var_4
2022-01-25 83.15106 77.58532 88.85743 91.75516
2022-02-25 50.57743 93.83828 59.57863 56.19557
2022-03-25 92.09622 54.00683 66.53835 66.70615
2022-04-25 62.74961 58.89334 56.34485 68.11273
2022-05-25 88.70798 87.75809 65.69200 57.85879
2022-06-25 72.01018 79.88379 61.53902 95.35869
2022-07-25 94.04675 66.27377 75.93474 86.22792
2022-08-25 52.57497 91.33814 85.01536 73.57971
2022-09-25 51.21142 71.47336 53.46573 77.88881
2022-10-25 70.93977 65.64055 70.35639 55.68449
2022-11-25 80.27070 94.45147 75.94056 55.40697
2022-12-25 58.05967 74.95672 81.39050 50.77193
If you have daily data for instance, you can plot a single month adding [“2022-03”]. Here you will extract all values for the month March in 2022.
Using first() and last() you can extract the first x weeks of the dataset by including x weeks in the function first() and the last y months by including y months in the function last(). Note that you can refer to weeks even if the periodicity of the dataset is monthly. R will extract the all months within this x week period. Valid periods are seconds, minutes, hours, days, weeks, months, quarters and years. For instance:
first(mat_dfts, "2 quarters") var_1 var_2 var_3 var_4
2020-03-25 61.81532 71.34831 84.25773 87.37033
2020-04-25 83.67822 91.79865 79.53507 90.04232
2020-05-25 73.80289 58.45403 62.74126 71.43401
2020-06-25 62.37750 72.38072 71.67449 92.25952
last(mat_dfts, "2 quarters") var_1 var_2 var_3 var_4
2024-01-25 91.99313 69.49864 60.66829 86.80344
2024-02-25 80.60366 63.13936 74.42565 97.02449
2024-03-25 81.18256 90.97987 52.96503 61.63764
2024-04-25 76.47097 99.46359 91.42821 96.14635
Combining first() and last():
first(last(mat_dfts, "4 quarters"), "3 months") var_1 var_2 var_3 var_4
2023-07-25 60.92104 76.31153 93.02249 97.34942
2023-08-25 86.19620 78.71048 65.48694 94.27129
2023-09-25 80.78298 81.12048 95.89146 92.30111
Recall that mat_dfts includes a monthly time series. You can determine the endpoints for another time interval, e.g. quarter or year. Doing so, R selects the last observations per quarter or per year. In addition to year and quarter, you can also determine the endpoints for months, hours and minutes. Using these endpoints, you can extract the data for these moments.
Let’s first determine the endpoints per year (i.e. the last observations for a year):
end_year <- endpoints(mat_dfts, on = "year")
end_year[1] 0 10 22 34 46 50
These observations are included on the 10th row, the 22th row, … . Using this vector to subset the time series now allows to extract the values for all variables in mat_dfts:
mat_dfts[end_year] var_1 var_2 var_3 var_4
2020-12-25 50.25629 71.80983 54.86632 51.58079
2021-12-25 73.69827 60.67609 59.01273 80.17038
2022-12-25 58.05967 74.95672 81.39050 50.77193
2023-12-25 76.81282 78.46668 88.45526 82.58745
2024-04-25 76.47097 99.46359 91.42821 96.14635
There are two special functions that allow you to extract the core data and the index. The first refers to all variables, other than the date/time index. To extract these variable, you use the coredata() function:
core <- coredata(mat_dfts)
head(core, n = 5) var_1 var_2 var_3 var_4
[1,] 61.81532 71.34831 84.25773 87.37033
[2,] 83.67822 91.79865 79.53507 90.04232
[3,] 73.80289 58.45403 62.74126 71.43401
[4,] 62.37750 72.38072 71.67449 92.25952
[5,] 64.98585 65.08229 91.09376 85.55991
The index refers to the date/time index. Using the index() function allows you to extract these values:
datetime <- index(mat_dfts)
head(datetime, n = 5)[1] "2020-03-25 CET" "2020-04-25 CEST" "2020-05-25 CEST" "2020-06-25 CEST"
[5] "2020-07-25 CEST"
Counting the number of months, quarters or years in a time series dataset can be done using nmonths(), nquarters() or nyears(). For instance, mat_dfts includes:
nmonths(mat_dfts)[1] 50
50 months,
nquarters(mat_dfts)[1] 18
18 quarters and
nyears(mat_dfts)[1] 5
5 years of data.
Note that the here, the first and last of these five years doesn’t include data for all 12 months in that year.
You can determine the periodicity (e.g. monthly, yearly, hourly) using periodicity(). The function estimates the frequency of the time series observations:
periodicity(mat_dfts)Monthly periodicity from 2020-03-25 to 2024-04-25
In addition to the function we have introduced for other data structures, there are a couple of function specific to time series. The first function is lag(x, k). This function computes the lagged version of a time series. For instance, with k = 1 the lag of a monthly series shifts the series one month back in time. In doing so, the observation for the lag of march 2025 is february 2025. This allows you to compute the difference between to observations across time. The default value for k = 1. Changing this to e.g. 12 for a monthly series computes the value for the same variable 12 months ago. Because the first k observations are missing, R changes these values into NA. For instance, to determine the monthly change in all variables included in mat_dfts
mat_lag1 <- mat_dfts - lag(mat_dfts, k = 1)
head(mat_lag1, 5) var_1 var_2 var_3 var_4
2020-03-25 NA NA NA NA
2020-04-25 21.862903 20.450336 -4.722668 2.671986
2020-05-25 -9.875329 -33.344624 -16.793810 -18.608307
2020-06-25 -11.425395 13.926690 8.933233 20.825511
2020-07-25 2.608355 -7.298423 19.419271 -6.699614
If you change k = 1 into k = 12 calculated the change relative to the same month in the previous year. This is often referred to as Year of Year (YoY) changes:
mat_lag12 <- mat_dfts - lag(mat_dfts, k = 12)
last(mat_lag12, 5) var_1 var_2 var_3 var_4
2023-12-25 18.753147 3.509959 7.0647632 31.815517
2024-01-25 7.487500 -28.240115 -4.3133763 36.497257
2024-02-25 -18.817803 -31.976606 0.2734934 16.111089
2024-03-25 7.618603 9.233648 -21.6129541 -34.376052
2024-04-25 6.163255 9.653780 29.0678414 -1.707843
Using diff(x, lag = 1, differences = 1) allows you to calculate similar differences. The lag = 1 arguments specifies the lag and is simular to the k = 1 argument in the lag() function. The differences = 1 argument allows you to specify the order of the differencing. The first order (by default) calculate the difference between the levels. The second order calculate the difference in the differences (i.e. second derivative). To illustrate:
mat_dif1 <- diff(mat_dfts, lag = 1, differences = 1)
head(mat_dif1, 5) var_1 var_2 var_3 var_4
2020-03-25 NA NA NA NA
2020-04-25 21.862903 20.450336 -4.722668 2.671986
2020-05-25 -9.875329 -33.344624 -16.793810 -18.608307
2020-06-25 -11.425395 13.926690 8.933233 20.825511
2020-07-25 2.608355 -7.298423 19.419271 -6.699614
calculates the same change as x - lag(x, k = 1). However,
mat_dif2 <- diff(mat_dfts, lag = 1, differences = 2)
head(mat_dif2, n = 5) var_1 var_2 var_3 var_4
2020-03-25 NA NA NA NA
2020-04-25 NA NA NA NA
2020-05-25 -31.738232 -53.79496 -12.07114 -21.28029
2020-06-25 -1.550066 47.27131 25.72704 39.43382
2020-07-25 14.033750 -21.22511 10.48604 -27.52512
calculates the change in the difference: the difference in the difference of the second order difference.
Recall the apply function for matrices. The period.apply() function has a similar use for time series. The function requires an xts object, an index and a function. The index needs to define non-overlapping intervals. The endpoints() function is an example that allows you to specify these intervals. You can also specify your own vector. As long as it starts and ends with the number of rows in the xts object and includes non overlapping intervals. The period.apply() function will then apply a function to all observations within an interval. For instance, recall that endpoints returns a vector with index breakpoints:
end_year <- endpoints(mat_dfts, on = "years")
end_year[1] 0 10 22 34 46 50
The first inverval runs from 0 to the 10th observations. The second yearly interval from the 11th to the 22th observation, … . You can now use period.apply() to calculate e.g. the mean for every year:
period.apply(mat_dfts, INDEX = end_year, FUN = colMeans) var_1 var_2 var_3 var_4
2020-12-25 63.62546 71.06288 77.21998 81.80398
2021-12-25 75.99454 74.77712 72.00702 77.25454
2022-12-25 71.36631 76.34164 70.05446 69.62891
2023-12-25 78.67123 82.97227 78.22946 75.37588
2024-04-25 82.56258 80.77036 69.87180 85.40298
As you can see, this code returns the mean value per year for all 4 variables. If it would make more sense to calculate the sum, e.g.
period.apply(mat_dfts, end_year, colSums) var_1 var_2 var_3 var_4
2020-12-25 636.2546 710.6288 772.1998 818.0398
2021-12-25 911.9344 897.3254 864.0843 927.0545
2022-12-25 856.3958 916.0997 840.6536 835.5469
2023-12-25 944.0548 995.6672 938.7536 904.5106
2024-04-25 330.2503 323.0815 279.4872 341.6119
More in general, for every non-overlapping periode in the INDEX, the function period.apply() will apply the function in FUN. The index is a vector with positions what show the end points of every interval. For instance c(0, 3, 6, 9) would introduce intervals covering the first 3 observations, observations 4, 5 and 6, observations 7, 8 and 9, … . For each of these three observations, R would then apply the function in FUN. If this function is colMeans, it would apply, for every variable in the dataset, this function to every time interval and colSums calculates, for every variable in the dataset, the sum of the three components in each of the time intervals.
Make sure that the {xts} package is loaded.
Create a 104x2 matrix data with column names high and low and values runif(104, 100, 200) and runif(104, 10, 20):
data <- matrix(c(runif(104, 100, 200), runif(104, 10, 20)), 104, 2)
colnames(data) <- c("high", "low")Add a weekly time sequence starting 2023-01-01 with 104 weeks and assign the value weeks and add this variable to the data matrix:
weeks <- seq.POSIXt(from = as.POSIXct("2023-01-01", format = "%Y-%m-%d", tz = "UTC"), length.out = 104, by = "weeks")
data <- cbind(weeks, data)Add both in an xts object datats and remove the weeks column:
datats <- as.xts(data, order.by = as.POSIXct(data[, 1]))
datats <- datats[, -1]Determine the periodicity of datats as well as the number of months and years:
periodicity(datats)Weekly periodicity from 2023-01-01 01:00:00 to 2024-12-22 01:00:00
nmonths(datats)[1] 24
nyears(datats)[1] 2
Determine the quarterly end ponts
end_quar <- endpoints(datats, on = "quarter")Use the period.apply() function to calculate the sum per quarter of the variables in datats. Store the results in datatsq
datatsq <- period.apply(datats, end_quar, colSums)
datatsq high low
2023-03-26 01:00:00 1873.203 185.6338
2023-06-25 02:00:00 2058.894 211.4065
2023-09-24 02:00:00 1871.825 197.9169
2023-12-31 01:00:00 2219.496 180.3415
2024-03-31 01:00:00 1924.175 191.3454
2024-06-30 02:00:00 2034.071 178.1488
2024-09-29 02:00:00 1901.548 199.3277
2024-12-22 01:00:00 1931.220 163.2491
Calculate the monthly difference for the variables in datats and store the results in diff_datats:
diff_datats <- diff(datats, lag = 1, difference = 1)Use lag() to calculate the percentage change in high and store as pct_high:
pct_high <- (datats$high - lag(datats$high))/lag(datats$high)A data.table is an enhanced data.frame. This data structure allows you to e.g. search for data inside the table using SQL-type formatting. To uses this data structure, you need to install and load the {data.table} package. To do so, you first install the package (if you haven’t done so yet)
install.packages("data.table")and load the package
library(data.table)Warning: package 'data.table' was built under R version 4.4.3
Attaching package: 'data.table'
The following objects are masked from 'package:xts':
first, last
The following objects are masked from 'package:zoo':
yearmon, yearqtr
Here, we will not cover data.tables in depth, but give a couple of examples on how it differs from the traditional data.frame. These examples will show why a data.table is usually faster than a data.frame, especially on large datasets. If you need to work with very large datasets, you can use e.g. Barrett et al. (2025) as a starting point for introduction to this data structure.
Let’s first create a data.table. You’ll see that the basic syntax is comparable to the usual data.frame() syntax:
dt <- data.table(
firm = LETTERS[1:25],
sales = runif(25, 100, 200),
margin = rnorm(25, 10, 2),
sector = sample(c("services", "services", "industry", "construction", "transport"), 25, replace = TRUE))
head(dt, 10) firm sales margin sector
<char> <num> <num> <char>
1: A 196.4297 11.993541 transport
2: B 136.2853 9.378965 transport
3: C 111.2189 13.642239 transport
4: D 135.9506 9.769425 services
5: E 177.5073 11.911084 services
6: F 106.0136 8.349874 services
7: G 121.1611 13.540675 services
8: H 195.2492 11.021530 services
9: I 196.5682 10.317356 services
10: J 191.2377 9.823625 services
This function returns a data.table. As a data.table is an enhanced version of a data.frame, it is also a data.frame.
class(dt)[1] "data.table" "data.frame"
In other words, you can use all data.frame functions or subsetting rules to data.tables:
head(dt[, 1], 5) firm
<char>
1: A
2: B
3: C
4: D
5: E
dt$sales [1] 196.4297 136.2853 111.2189 135.9506 177.5073 106.0136 121.1611 195.2492
[9] 196.5682 191.2377 190.2272 156.6190 145.9440 113.6654 174.5056 198.7093
[17] 188.1009 167.9208 156.7068 173.0615 151.7091 144.0244 106.2058 113.5574
[25] 174.3352
dt[["margin"]] [1] 11.993541 9.378965 13.642239 9.769425 11.911084 8.349874 13.540675
[8] 11.021530 10.317356 9.823625 8.171818 10.430808 12.485047 13.928489
[15] 9.368508 6.896699 9.374375 8.561754 11.633696 11.025556 11.621973
[22] 9.256063 8.135490 10.406770 7.344467
head(dt[sales > 150, 1:3], 5) firm sales margin
<char> <num> <num>
1: A 196.4297 11.993541
2: E 177.5073 11.911084
3: H 195.2492 11.021530
4: I 196.5682 10.317356
5: J 191.2377 9.823625
The result of these 4 subsetting operations return a data.table (preserving operator ([])) or vectors (simplifying operator([[]] or $)). Note that you can use
The data.table includes names (columns) and row names (rows are numbered in this case and numbers are stored as $row.names). The attributed further include the class as well as the location in your memory where R stored the data.table.
attributes(dt)$names
[1] "firm" "sales" "margin" "sector"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
$class
[1] "data.table" "data.frame"
$.internal.selfref
<pointer: 0x0000018f5a429cd0>
However, in addition to the subsetting rules for data.frames, data.tables allow you to subset observations using dt[i, j, by] where i are the rows to subset or reorder, j refers to a calculation and by refers to a group. Let’s start with i and extract only firms that are in “industry” or in “transport”:
dt_ind <- dt[sector == "industry" | sector == "transport"]
dt_ind firm sales margin sector
<char> <num> <num> <char>
1: A 196.4297 11.993541 transport
2: B 136.2853 9.378965 transport
3: C 111.2189 13.642239 transport
4: K 190.2272 8.171818 transport
5: M 145.9440 12.485047 transport
6: Q 188.1009 9.374375 industry
7: T 173.0615 11.025556 industry
8: W 106.2058 8.135490 industry
9: Y 174.3352 7.344467 transport
As you would expect, R subsets the data.table and extract only those observation where the boolean operation: sector == "industry" | sector == "transport" is TRUE and skips all other observations as the boolean operation returns FALSE. Here, you use subsetting rules that you know from the data.frame section.
Let’s now add a calculation within the subsetting and ask R to calculate the sum, the mean, minimum and maximum values for these two industries. To do so, we use the j position in dt[] where the i position is used to select the industries and the j position is now used to include a calculation. As we have more than one calculation (sum, mean, min and max) we include them within () and add a dot:
dt_ind_sum <- dt[sector == "industry" | sector == "transport", .(sum(sales), mean(sales), min(sales), max(sales))]
dt_ind_sum V1 V2 V3 V4
<num> <num> <num> <num>
1: 1421.809 157.9787 106.2058 196.4297
We now have a data.table with the sum, mean, the minimum and maximum values for sales for these two industries. Note that in this case, subsetting for data.tables and data.frames is different. Within a data.frame, you can not add calculations within the subsetting operators.
To calculate these for each industry, we can now use the by position. If we include by = "sector", R will calculate the sum, mean, min and max for each sector.
dt_ind_sum <- dt[sector == "industry" | sector == "transport", .(sum(sales), mean(sales), min(sales), max(sales)), by = "sector"]
dt_ind_sum sector V1 V2 V3 V4
<char> <num> <num> <num> <num>
1: transport 954.4403 159.0734 111.2189 196.4297
2: industry 467.3682 155.7894 106.2058 188.1009
Here, R reads the dt[] subsetting as: using only industry or sector, calculate the sum, mean, min and max for different value in sector. Here, the only difference values in sector are “industry” or “transport”
dt_ind_sum <- dt[margin > 7.499, .(sum(sales), mean(sales), min(sales), max(sales)), by = "sector"]
dt_ind_sum sector V1 V2 V3 V4
<char> <num> <num> <num> <num>
1: transport 780.1051 156.0210 111.2189 196.4297
2: services 2032.0241 156.3095 106.0136 196.5682
3: construction 270.3722 135.1861 113.6654 156.7068
4: industry 467.3682 155.7894 106.2058 188.1009
A data.table further allows you to create new variable for all observations in a data.table, you can use the j position and crate a new variable as a function of the other variables. To do so, you use the name of the new variable followed by a := and the function R needs to apply. For instance, generating a variable gross_profit as the product of sales and margin (where you divide margin by 100 to obtain a percentage):
dt[, gross_profit := sales * (margin/100), ]
head(dt, n = 5) firm sales margin sector gross_profit
<char> <num> <num> <char> <num>
1: A 196.4297 11.993541 transport 23.55887
2: B 136.2853 9.378965 transport 12.78215
3: C 111.2189 13.642239 transport 15.17274
4: D 135.9506 9.769425 services 13.28159
5: E 177.5073 11.911084 services 21.14305
Because you run the operations as creating a new variable, calculating sum, mean or min and max within the subsetting, a data.table is faster, especially on large datasets relative to a data.frame. Using the latter, most of the results shown here would requires R to call functions in other packages. Doing so, slows down the process.